Reducing human effort in rating software

24 Nov 2025

Manually evaluating code is expensive. Research co-authored by SMU Associate Professor Christoph Treude explores how large language models may lessen the load in software engineering annotation.

By Alistair Jones

SMU Office of Research Governance & Administration – A dystopian future where advanced artificial intelligence (AI) systems replace human decision-making has long been a trope of science fiction. The malevolent computer HAL, which takes control of the spaceship in Stanley Kubrick's film, 2001: A Space Odyssey, is a chilling example.

But rather than being fearful of automation, a more useful response is to consider what types of repetitive human tasks could be safely offloaded to AI, particularly with the advances of large language models (LLMs) that can sort through vast amounts of data, see patterns and make predictions.

Such is the area of recent research co-authored by Christoph Treude, an Associate Professor of Computer Science at Singapore Management University (SMU). The team explores potential roles for LLMs in annotating software engineering artifacts, a process that is expensive and time consuming when done manually.

"Our work has investigated when and how LLMs can reduce – not replace – human effort in software engineering annotation," Professor Treude says of his paper Can LLMs Replace Manual Annotation of Software Engineering Artifacts?, which won the ACM SIGSOFT Distinguished Paper Award at the 22nd International Conference on Mining Software Repositories (MSR2025).

"We examined 10 cases from previous work where multiple humans had annotated the same samples, covering subjective tasks such as rating the quality of code summaries. We found that for low-context, deductive tasks, one human can be replaced by an LLM to save effort without losing reliability. However, for high-context tasks, LLMs are unreliable.

"[In terms of impact], our study has already been cited many times as guidance on integrating LLMs into annotation workflows, though it is sometimes over-interpreted as endorsing full replacement of humans by LLMs, which we explicitly caution against."

Consistent agreement

So, what is the process of annotations in software engineering?

"This kind of annotation assigns predefined labels to artefacts when quality or meaning can’t be judged with simple metrics. For example, whether a generated code summary is accurate, adequate, concise and similar," Professor Treude says.

"This annotated data allows researchers to evaluate tools, analyse developer behaviour, or train new models."

For the study, raters were arranged in pairs and the degree of agreement between each pair was scored.

"Their agreement reflects label reliability," Professor Treude says. "Our study compares human-human, human-model and model-model agreement to see whether LLMs behave like human raters."

Before the LLMs began labelling, the researchers provided them with a few examples of correct input-output pairs.

"These 'few-shot' prompts act as short training examples, helping the model learn the labelling style and increase consistency. It is a widely used best practice in LLM research," Professor Treude says.

The researchers compared different large language models, including GPT-4, Claude 3.5 and Gemini 1.5. Did one stand out?

"Newer, stronger models show slightly higher consistency, but the main signal is the agreement across models. When several capable LLMs independently agree, the task is likely suitable for some automation, regardless of which specific model is used," Professor Treude says.

And for what tasks are LLMs most suitable for substituting human effort?

"LLMs work well for low-context, deductive labelling, tasks with clear, pre-defined categories, such as checking if a variable name matches its value, or assessing code summary quality," Professor Treude says.

"They struggle with high-context tasks, such as deciding whether a bug report was truly resolved. In our 10-task dataset, seven allowed the safe substitution of one human."

Situational awareness

According to the researchers, model-model agreement is a predictor of whether a given task may be suitable for LLMs.

"We found a strong positive correlation between model-model and model-human agreement. In other words, when multiple models give the same answer, they usually match human judgment as well," Professor Treude says.

"High model-model agreement can indicate that LLMs are able to perform a task well, while low agreement is an early warning to avoid automation."

The researchers say it is unsafe to substitute LLMs for humans in high-context tasks, for example when working with static analysis warnings, which identify potential issues in source code, such as bugs, security vulnerabilities and coding standard violations.

"That task requires substantial contextual understanding: examining code changes, project history and warning semantics. Humans achieve high agreement, but models perform poorly. This demonstrates that LLMs still lack the deep situational awareness needed for complex software engineering tasks," Professor Treude says.

While humans can be replaced for selective samples, there are risks.

"High agreement doesn’t eliminate bias. LLMs may systematically err in one direction: for example, always giving higher quality scores, or reinforce patterns from their training data," Professor Treude says.

"There’s also a methodological risk of over-generalising these findings to other qualitative methods. To some extent, LLMs can support deductive coding, but they cannot replace the reflexivity and interpretation that are central to qualitative software engineering research.

"Our view is pragmatic: use LLMs to accelerate annotation where it’s safe, not to eliminate human judgment."

Beyond the scenario

"Our study is the first to systematically evaluate LLM substitution across software engineering annotation tasks and different LLMs," Professor Treude says.

"We introduced two data-driven predictors for task suitability and sample selection: model-model agreement (how often two LLMs independently agree) and model confidence (how sure the model is about its own label). Together, these imply a two-step workflow for deciding when LLM assistance is appropriate: first for the task, then for each sample."

But the researchers are aware that their study has limitations.

"We focused on a narrow slice of annotation: discrete, multiple-choice labelling with pre-defined categories on pre-selected samples," Professor Treude says.

"We didn’t study open-ended interpretive analysis, continuous ratings, or free-text outputs. We also didn’t test for demographic bias, longitudinal model drift, or prompt variation. Therefore, our results apply specifically to structured, deductive annotation."

The researchers describe their work as a first step.

"The next step is to go beyond the scenario of replacing a single annotator. We plan to investigate situations where the annotation categories are not pre-defined and where models must deal with open-ended or evolving codebooks. We are also exploring other types of software engineering tasks to test whether the same reliability patterns hold," Professor Treude says.

"More broadly, this is part of a larger research agenda on the use of AI in science, examining how these models might assist, and where they fall short, across different qualitative methods such as thematic analysis, grounded theory, or interview coding."

Back to Research@SMU November 2025 Issue

See More News

« Previous News

Social Norms and the Psychology of Litter Prevention

SMU and South Korea to create seminal AI deepfake detection tool

Reducing human effort in rating software

See More News

Want to see more of SMU Research?