ChatGPT has been a game-changer for education. Students now frequently use Generative Artificial Intelligence to complete assignments, but concern is growing about how this affects their academic integrity and critical thinking.
Michelle Cheong is a Professor of Information Systems in Education at the Singapore Management University. By evaluating ChatGPT’s performance in spreadsheet modelling, her latest research provides important insights into how educators can redesign student assessments to enhance learning at different cognitive levels.
Read the original research: https://ink.library.smu.edu.sg/sis_research/10172/
Hello and welcome to Research Pod! Thank you for listening and joining us today.
Generative Artificial Intelligence, that’s ‘GAI’, has been a game-changer for education. Tools like ChatGPT can now produce human-like responses to questions on most subjects. Just three years after its launch, a 2025 study from the UK’s Higher Education Policy Institute found that 88% of students had used GAI in the last year to help them complete assignments.
Not surprisingly, educators are finding it more challenging to stay ahead of the technology – and their students – including spotting when students are passing off AI-generated responses as their own. Some critics focus their concern on the ethics of GAI, arguing that it threatens academic integrity. Others are worried that it inhibits students’ own critical thinking and writing skills, plus the fact that its answers can sometimes be wrong.
Few academic studies have yet quantified how GAI affects student learning. New research from Singapore Management University provides a timely insight into how assessments can be redesigned to make them GAI-resistant and ensure that students achieve the right learning outcomes.
The work has been led by Michelle Cheong, Professor of Information Systems in Education.
Dr Cheong’s research is based on evaluation of ChatGPT’s performance in solving hypothetical business problems using spreadsheet modelling. Version 3.5 of ChatGPT was used, as it’s freely available to students. Spreadsheet modelling was chosen as the assessment task, because outcomes based on numbers and logical reasoning can be analysed more objectively than text.
Two spreadsheets modelling quizzes each with five linked questions were prompted to ChatGPT to find the answers. The first quiz focused on financial calculations to find the best deal to buy a computer, according to different discounts on offer. The second quiz used Monte Carlo simulation to determine how many new COVID-19 infections could be expected among people who queued next to one another for a long time, and what could be done to reduce the number of new infections.
To ensure that ChatGPT couldn’t refer to its training data for answers, the Quiz questions were original and hadn’t been posed online before the research took place.
The questions for each quiz were categorized to different learning levels according to the revised Bloom’s Taxonomy, which focuses on students’ intellectual skills and developing ability to solve problems. First developed in the 1950s, Bloom’s Taxonomy is a recognised framework that divides learning objectives into six increasingly complex levels of cognition, and it was later revised into a two-dimensional model.
The levels comprise (1) knowledge and (2) comprehension, for example being able to explain and summarise content. Next come (3) application and (4) analysis, such as being able to apply and analyse the information gained in a different situation. The highest levels of cognitive ability are (5) evaluation and (6) creation –– being able to make judgements about content and use knowledge gained to inform and create new work and ideas.
Four different prompt engineering settings were also used in the study. ‘Prompts’ are the instructions given to GAI models to guide how they respond to questions. The way prompts are given, or ‘engineered’, provides context and helps ChatGPT know what kind of information we’re looking for.
Basic or so-called ‘Zero-Shot-Baseline’ prompts mimic our first attempts at a question, without any prompt components, setting any hyperparameters or providing any worked example. For example, ‘How many apples can I buy with $10 if each apple costs $0.50? The answer is.”
‘Zero-Shot-Chain-of-Thought’ prompts are still basic, but it involves 2 stages and should get better results. In stage 1, a trigger ‘Let’s think step by step’ is added to the prompt to extract reasoning steps from ChatGPT; and in stage 2, the Chain-of-Thought reasoning steps extracted are added to the prompt. For example, ‘How many apples can I buy with $10 if each apple costs $0.50? Let’s think step by step.’ ChatGPT will provide the Chain-of-Thought, which could be ‘To compute the answer, divide $10 by $0.50.’ Then the second prompt will be ‘How many apples can I buy with $10 if each apple costs $0.50? To compute the answer, divide $10 by $0.50. The answer is.’
‘One-Shot’ prompts are more sophisticated. The prompt includes one resolved example of a similar problem. For example, ‘How many oranges can I buy with $5 if each orange costs $1? The answer is 5. How many apples can I buy with $10 if each apple costs $0.50? The answer is.’
Finally, ‘One-Shot-Chain-of-Thought’ prompts include a resolved example of a similar problem and also the thought process we want the GAI model to follow. For example, ‘How many oranges can I buy with $5 if each orange costs $1?. The answer is 5 by using $5 divided by $1 . How many apples can I buy with $10 if each apple costs $0.50? The answer is.’
Dr Cheong’s study had two main questions. How does ChatGPT perform in spreadsheet modelling assessments using questions categorised according to the revised Bloom’s taxonomy and using different prompt settings? And, based on these results, which cognitive levels should educators focus on when they are redesigning assessments to help students’ develop higher order thinking skills?
The study’s findings reveal that ChatGPT performed well up to certain levels of cognition, after which the quality of answers deteriorated and included many mistakes. For example, at lower levels it could explain technical facts and complete basic calculations. But accuracy decreased at higher levels, and despite correctly identifying the arithmetical function or approach needed to find a solution, ChatGPT’s results were wrong.
Specifically, when students used the most basic ‘Zero-Shot-Baseline’ prompt settings, ChatGPT’s results were good up to level three of Bloom’s taxonomy – the levels that comprise (1) knowledge, (2) comprehension and (3) application. However, from level four upwards – analysis – ChatGPT performed less favourably and made significant errors.
When prompt settings were increased to add ‘Chain-of-Thought’ to basic guidance (‘Zero-Shot-Chain-of-Thought’), the results with good accuracy increased up to level five of Bloom’s taxonomy – the level of synthesis. This suggests the level at which educators may still fail to spot ChatGPT’s input into student assignments. This spell concerns for educators as students only need to add a simple sentence ‘Let’s think step by step’ to get high quality responses.
The accuracy of ChatGPT’s results improved significantly for questions at levels 3 (application) and 4 (analysis) of Bloom’s taxonomy when solved examples of similar problems were added in ‘One-Shot’ prompts. However, to get significant improvements to the fifth level of synthesis, prompts had to be increased to ‘One-Shot-Chain-of-Thought’ prompts to include both a solved example and steps on how the solution can been achieved. To create a solved example to include into the prompt may not be easy for students, thus ‘One-Shot’ and One-Shot-Chain-of-Thought’ prompts are lesser of a concern for educators.
Interestingly, none of the prompts tested allowed ChatGPT to achieve achieved Bloom’s sixth and highest level of cognitive ability, which is the ability to use the information gathered to create new solutions to problems.
Based on the results, Dr Cheong makes four recommendations to help educators redesign assessments and improve students’ thinking and learning skills.
The first is that students should be allowed to use ChatGPT in-class, but that educators should guide them through learning activities that help them identify errors and limitations in ChatGPT’s results. They should also ask students how they think they could improve the quality of ChatGPT’s answers, moving from lower to higher cognitive level questions.
To help students construct knowledge and develop their prompting skills, the study also recommends that educators should encourage students to spend time formulating and sharpening ChatGPT questions and experiment with different prompt settings. They should then analyse the results to evaluate the different responses ChatGPT provides.
The third recommendation focuses on how collaboration helps us construct knowledge. Dr Cheong suggests that assessment should include peer group projects that include ChatGPT as a group member. A project could, for example, ask students to use ChatGPT to help them build a spreadsheet model to solve a complex, real-world challenge such as when electric vehicles might replace petrol vehicles in a given country.
Last but not least, the study recommends that educators should focus their attention on testing students’ higher-order thinking skills – namely cognitive level four upwards skills, where ChatGPT was less effective – to compel students to complete assessments on their own, without using GAI tools.
By following these recommendations Dr Cheong hopes both educators and students will come to understand ChatGPT’s limitations and promises, as well as how human collaboration with GAI can lead to enriched educational experiences.
So, what other conclusions can we draw from Dr Cheong’s research?
Most importantly, success in ChatGPT’s performance as a learning tool depends on the cognitive level of the enquiry. In line with other research, the study found that accuracy of GAI-generated results decreases as questions become more complex.
Based on the spreadsheet modelling tasks used in this study, ChatGPT’s performance improved with prompt engineering. But understanding prompt engineering, not least how to formulate solved examples to improve prompts, is an acquired skill. It demands considerable knowledge and effort from students and educators alike.
By providing empirical evidence on Chat GPT’s performance at different cognitive levels Dr Cheong’s study breaks important new ground. Educators have work to do if they are to redesign assessments to mitigate the negative impacts of Generative AI on student learning.
This new study provides valuable insights into how this might be achieved, and its methodology can be applied to subjects other than spreadsheet modelling.
We don’t yet know whether Generative AI is a blessing or a curse. But platforms like ChatGPT are here to stay and they’ll no doubt become even more powerful over time. Not only does this have implications for the future of education and educational assessment, it also has implications for how knowledge is defined and how it is acquired.
That’s all for this episode – thanks for listening. Links to the original research can be found in the show notes for this episode. And don’t forget to stay subscribed to Research Pod for more of the latest science!
See you again soon. Also published on: https://researchpod.org/business/redesigning-student-assessment-chatgpt
Podcast is also available on Spotify, Apple iTunes, Google Podcasts, and many more (please use search term “ResearchPod”).
See More News
Want to see more of SMU Research?
Sign up for Research@SMU e-newslettter to know more about our research and research-related events!
If you would like to remove yourself from all our mailing list, please visit https://eservices.smu.edu.sg/internet/DNC/Default.aspx
