Back to home
Performance

Performance Study

Our goal was to develop a system whose errors stem from technological limitations rather than design aspects — with the aim of creating the most precise rule-based automated scoring possible today.

Kander AkinciApril 2025

We want to demonstrate how accurate rule-based automated scoring with GPT models can be while also highlighting its strengths and weaknesses. All results are generated with the GPT-4o-2024-08-06 model and may vary with other models.

Introduction

This report shows how well a rule-based system using GPT-4o can score student answers in an electrical engineering exam. The goal was to test how accurate and reliable automated scoring can be when clear rules are used. To measure performance, we looked at how often the system agreed with a teacher (interrater reliability) and how consistent the system was when scoring the same answer more than once (intrarater reliability). We used standard metrics like Kappa values and error rates to evaluate this.

The results show that the system works very well in most cases, especially when tasks are clearly defined. It can score answers with high accuracy and consistency, but some weaknesses remain — especially when the scoring instructions are unclear or when the model misunderstands the task. The system is strong in being flexible, repeatable, and able to work across subjects, but it requires careful setup and regular updates to remain accurate.

Dataset

The Electrical Engineering exam on the topic of circuit breakers consists of 13 independent open-ended tasks and was completed by 54 students. The participants come from two different grade levels: a senior class (12th grade) and a middle-level class (11th grade). While notable differences in response quality exist between these groups, linguistic variations primarily result from students' sociocultural backgrounds. The exam primarily consists of open short-text responses that require detailed feature extraction and synthesis. Two tasks involve calculations with multiple computation steps. Out of a total of 689 responses, 72 were missing (10.45%). To ensure precise and reliable evaluation, all invalid or empty responses were excluded from the analysis, preventing artificially inflated agreement rates.

Data Collection

In human ratings, the consistency of interpretation (feature extraction) and the application of the scoring scheme (evidence accumulation) pose challenges. In contrast, evidence accumulation remains identical in automated systems. However, it is necessary to clarify whether features can be reliably identified and extracted [1]. Subject matter experts are characterized by their special competence in feature extraction within their domain, yet they do not always demonstrate consistency in applying the scoring scheme [16]. This ambivalent situation, in which the teacher functions both as a source of error and as a benchmark, requires a differentiated data collection approach and the control of relevant influencing factors in order to systematically analyze scoring deviations.

Particularly when an instrument is still in its early stages, an investigation that clarifies possibilities, anomalies, and boundary conditions — a formative evaluation — is the most valuable [6]. Therefore, it is necessary to conduct post-administrative analyses even for complex tasks that are scored using automated systems. The operational implementation enables a comprehensive evaluation of automated scoring across the full range of participant responses. This allows for the identification of rare and unforeseen cases that could potentially lead to errors in automated scoring [16].

Subject matter experts play a central role in this process. They are initially required to provide high-quality ratings and to operationalize the measurement intent within the scoring model. Williamson, Bejar, and Sax [16] propose five key steps for an operational evaluation:

  1. Evaluation: Determining both automated and human ratings.
  2. Deviations: Comparing automated and human ratings to identify discrepancies.
  3. Analysis: Selecting a discrepant case and investigating the cause(s).
  4. Decision: Determining whether the automated rating, the human rating, or a compromise should be used as the final rating.
  5. Re-evaluation: Reassessing all responses through the system.

A particular challenge in implementing adjustments is that not only must an immediate discrepancy be corrected, but it must also be ensured that the modification does not introduce new inconsistencies into the scoring system. Therefore, all adjustments must be experimentally tested to assess their impact on all responses before being integrated into the operational system. Furthermore, the individual application of scoring criteria by teachers can vary, as their assessments are not always entirely consistent. In some cases, this results in repeated minor adjustments to automated scoring without establishing a stable solution, potentially leading to an endless loop. Therefore, it is essential that adjustments are not only made in a targeted manner but also examined for their long-term effects [16].

Preparation Phase

Due to the aspects mentioned above and the objective of achieving increased efficiency in teacher performance through an automated scoring system, the five previously described steps are first tested using a set of four student responses. This allows the development of the scoring model within a limited scope and supports a practical model-building approach, as well as adjustments made by the teacher. The student responses are deliberately selected to cover a wide range of borderline cases and to minimize scoring deviations.

First, the teacher undergoes comprehensive training to develop a scoring model based on the designed model-based scoring approach. This scoring model serves as the foundation for assessing open-ended tasks and includes clearly defined scoring and evidence rules. Subsequently, the teacher reviews tasks that are equipped with relevant materials and contexts. Each task is pre-defined to specify which specific skills are being measured, which observations are required for assessment, how points are awarded, and how they are aggregated into a final score.

In the next step, the teacher adapts the scoring model by creating aggregation nodes and feature nodes to optimally represent the intended scoring criteria. Following this, the four selected student responses are scored by both the system and the teacher.

The system calculates the Relative Mean Absolute Error (RMAERMAE) for each feature node across all four student responses to quantify the average deviation between human and automated scoring. Subsequently, the feature nodes are adjusted to align with the intended scoring criteria. This process is iteratively repeated until a sufficient agreement between the system's and the teacher's scores is achieved. Effectively, this process aims to close unintended gaps in interpretation.

Evaluation

After the teacher has considered exemplary student responses in the preparation phase, the remaining responses are scored by the system based on the predefined scoring model. The teacher then reviews the system's scores by critically analyzing both the underlying evidence rules and the system's reasoning, identifying potential deviations.

This approach has the advantage of allowing a certain degree of intentional interpretative flexibility within the feature node instructions. This flexibility may arise due to the non-convergent nature of the problem, the absence of precisely defined scoring standards set by the teacher, or the teacher's deliberate delegation of evaluation responsibility to the system beyond a certain level of granularity.

By systematically falsifying the system's assessments, unnecessary random errors are avoided — errors that could arise when feature nodes allow interpretative flexibility if the teacher's scoring is conducted independently of the system. In such cases, the teacher's assessment could itself introduce a random component.

This approach shifts the focus of assessment to the system, as the teacher evaluates the system's scores rather than acting as the primary scoring authority. This could be interpreted as a bias. However, empirical evidence suggests that a reverse approach, in which the teacher and system independently perform evaluations, would lead to a disproportionately higher error rate [12][16]. This assumption is further supported by findings showing that automated systems based on expert-designed scoring logic can even surpass expert performance in scoring accuracy [10][1].

Data Analysis

Notation

The number of different student responses corresponds to the number of analyzed tasks and is denoted by NN, with tasks indexed as i=1,,Ni = 1, \dots, N. Each node has a set number of scoring categories kk, indexed as j=1,,kj = 1, \dots, k. The frequency with which a specific category jj is assigned to node ii is represented by nijn_{ij}, whose value is limited by the maximum number of rating repetitions, denoted as nn.

Interrater Reliability

The evaluation of agreement between raters is relatively straightforward at a summary level, as it primarily involves comparing scales. However, with more complex scoring guidelines, potential discrepancies increase significantly, while at the same time, the relevance of these discrepancies is often so minimal that they do not affect the overall score [2].

For this reason, not only the feature nodes are considered, but also the top-level aggregation nodes of each scoring model within each task model. This means that all subtask sum scores are regarded as intermediate steps toward the final score. In this way, the error at the feature node level can be related to both the intermediate and final results.

Relative Mean Absolute Error (RMAE)

The Mean Absolute Error is a measure of the average deviation between the teacher's assessments and the system's assessments. It is calculated by determining the average absolute deviation between the expert's ratings pip_i and the system's ratings p^i\hat{p}_i.

MAE=1Ni=1Npip^i\text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |p_i - \hat{p}_i|

The smaller the MAE, the more accurate the model. Since absolute differences are used, positive and negative errors do not cancel each other out. However, the MAE of a node alone does not provide a quantitative measure of rating quality because it is not normalized. The weighting MM of the respective node scales this value:

RMAE=MAEMRMAE = \frac{MAE}{M}

The Relative Mean Absolute Error describes the average relative deviation between the teacher's evaluation and the system's evaluation in relation to the maximum possible score.

Relative Total Signed Error (RTSE)

MAE provides a quantitative measure of the deviation of nodes. However, due to the loss of information regarding the direction of the error, it cannot indicate whether the nodes tend to have higher or lower values compared to the expert evaluation. To address this, the Total Signed Error (TSE) is introduced:

TSE=1Ni=1N(pip^i)\text{TSE} = \frac{1}{N} \sum_{i=1}^{N} (p_i - \hat{p}_i)

Similar to the RMAE, the TSE is also normalized by the weighting MM:

RTSE=TSEMRTSE = \frac{TSE}{M}

The RTSE expresses the relative deviation of a node, with positive and negative deviations potentially canceling each other out.

Cohen's Kappa coefficient (κ\kappa)

Cohen's Kappa (κ\kappa) is widely used as an indicator of agreement between two raters [11][14][9] and as a quality measure for automated scoring systems [5][13][17].

κ=PoPe1Pe\kappa = \frac{P_o - P_e}{1 - P_e}

where PoP_o is the observed agreement and PeP_e is the expected agreement:

Po=1Nj=1kMjjP_o = \frac{1}{N} \sum_{j=1}^{k} M_{jj}
Pe=1N2j=1kMj+M+jP_e = \frac{1}{N^2} \sum_{j=1}^{k} M_{j+} \cdot M_{+j}

The Kappa value lies within [1,1][-1, 1], where κ=1\kappa = 1 indicates perfect agreement, κ=0\kappa = 0 corresponds to chance, and κ=1\kappa = -1 indicates complete disagreement. If Cohen's Kappa exceeds 0.75, it is considered good to excellent interrater reliability [15].

Quadratically Weighted Cohen's Kappa (κW\kappa_W)

A key aspect of the Kappa coefficient is that it does not take into account which alternative category a rater chooses when there is no agreement. Since an ordinal scale is used, the categories have a natural ranking. A greater discrepancy should be weighted more heavily:

κW=11Po1Pe\kappa_W = 1 - \frac{1 - P_o^*}{1 - P_e^*}

The weight wjlw_{jl} between two categories is defined by the squared difference of their category numbers:

wjl=(jl)2w_{jl} = (j - l)^2

The quadratically weighted Kappa is nearly identical to the Pearson correlation coefficient when both the rating system and the raters assign integer ratings [5][14]. It is important to note that quadratically weighted Kappa is sensitive to the number of scale points and tends to yield higher values as the number of scale points increases [4].

Intrarater Reliability

To examine the repeatability of the assessment, intrarater reliability is calculated. The system generates a rating score for the same student response at different points in time [9]. The deviations between these ratings are then analyzed to evaluate the consistency of the assessment.

To determine the intrarater reliability, the four different assessments (N=4N=4) were re-evaluated six times (n=6n=6) by the system. Care was taken to ensure that the evaluations were independent of each other. Specifically, this means that no cached tokens were received during requests to the GPT model (Cached Token = 0). Additionally, the temperature of the model was set to 0 to make its output as deterministic as possible.

Quadratically Weighted Fleiss' Kappa (κFW\kappa_{FW})

Since this study analyzes the discrete point-based scoring of feature nodes, choosing appropriate metrics is crucial. Fleiss [18] proposed an extension of the Kappa coefficient for multiple raters. The weighted Fleiss' Kappa generalizes Cohen's Kappa for more than two raters and incorporates weighting factors:

κFW=PˉoPˉe1Pˉe\kappa_{FW} = \frac{\bar{P}_o - \bar{P}_e}{1 - \bar{P}_e}

where Pˉo\bar{P}_o represents the observed agreement between raters, while Pˉe\bar{P}_e denotes the expected agreement based on random category assignments.

Results

Interrater Reliability

The system performed 57 feature extractions for each of the 50 student responses, which were not used in developing the scoring model. The following table presents the agreement metrics, calculated as the mean agreement across all 50 responses, after the teacher reviewed the system's scoring.

NodeRMAERTSEκ\kappaκW\kappa_WEA
F10.041−0.0240.7920.8420.902
F570.143−0.1430.4780.5570.714
Mean0.091−0.0470.7450.7640.889
Std Dev0.0880.1010.2230.2230.095
Variance0.0080.0100.0500.0500.009
Median0.056−0.0250.7910.8400.908
Minimum0.000−0.3080.0000.0000.692
Maximum0.3080.1711.0001.0001.000

The primary result considered is the median, as it is a robust metric that is less susceptible to outliers. The RMAERMAE across all feature nodes is 5.6% with a standard deviation of 8.8%, while the RTSERTSE is −2.5%. The unweighted kappa value (κ\kappa) is 0.791, while the weighted kappa value (κW\kappa_W) is 0.84. According to Landis and Koch [11], these values correspond to a substantial to nearly perfect agreement. The exact agreement (EA) is 90.8%.

Furthermore, it becomes evident that some feature extractions show deviations of up to 30.8%. The following figure illustrates how the agreement between the system and the teacher relates to the frequency of feature extractions.

Histograms showing the distribution of RMAE and weighted Kappa (κ_W) across all feature nodes
Distribution of RMAE (left) and κW\kappa_W (right) across all 54 feature nodes.

7 out of 54 feature nodes have no errors (RMAE = 0% and κW\kappa_W = 1) and were therefore always rated correctly according to the teacher's assessment. For 29 feature nodes, the RMAE is between 0% and 6%, while for 30 feature nodes, κW\kappa_W is between 0.8 and 1. Notably, 10 feature nodes have an RMAE ranging from over 19% to 30.8%.

Total Sum Score

The total sum score of the entire test P0P_0 has an RMAERMAE of 3.7%, an RTSERTSE of −1.8%, a κ\kappa of 0.07, and a weighted κW\kappa_W of 0.964.

NodeRMAERTSEκ\kappaκW\kappa_WEA
P00.037−0.0180.0700.9640.073

The sensitivity of the kappa value to the number of scale points, as described by Brenner and Kleissch [4], is clearly observable. Nodes at higher levels of aggregation tend to have lower kappa values, while the weighted kappa is correspondingly higher. The highest node P0P_0 has a total of 651 possible states. The positive correlation between Cohen's kappa and exact agreement (EA) highlights that the low kappa values are primarily due to a low number of exact matches. The high weighted kappa values indicate that differences are relatively minor. This is further supported by the low RMAE of 3.7%.

Intrarater Reliability

Overall, the intrarater reliability of 0.876 indicates a high consistency within the assessments. However, the weighted Fleiss' kappa coefficient (κFW\kappa_{FW}) varies significantly between individual features, as reflected in the standard deviation of 0.324.

NodeκFW\kappa_{FW}
F11.000
F570.127
Mean0.734
Std Dev0.324
Variance0.105
Median0.876
Minimum0.000
Maximum1.000

Correlation between κFW\kappa_{FW} and RMAE

The Pearson correlation is −0.3027, indicating a weak to moderate negative linear relationship. With a p-value of 0.0261, this correlation is statistically significant at α=0.05\alpha = 0.05, meaning the probability of this result occurring by chance is only 2.61%.

This correlation suggests that, in general, feature nodes with lower errors tend to have higher agreement values. However, exceptions exist: some feature nodes exhibit low errors but also low agreement, while others display high errors despite high reliability.

Causes of Errors and Instability

κFW\kappa_{FW} = high, RMAE = high

A frequently occurring error was that a certain feature was repeatedly recognized as present, even though it was not actually found in the student's response. This could be due to the fact that the feature was mentioned in the task prompt and students were merely expected to transfer it.

This indicates that the GPT model does not strongly differentiate between the student's response and the task prompt but rather aligns more closely with the model answer provided in the prompt. This highlights that GPT is a probabilistic model based on probabilities and associations — it does not truly understand content and does not think consciously or reflectively. Teachers should be aware of this limitation and carefully adjust the prompting to ensure a reliable assessment.

κFW\kappa_{FW} = low, RMAE = low

In this feature extraction process, the recognition was inconsistent but was not classified as an error. The goal was to determine whether the student's response implicitly contained the required model answer — meaning whether the student might have intended the correct answer. The issue here is that the term implicit is not clearly defined.

The system attempts to identify, based on surface features, whether the student has understood the meaning of the model answer. This type of feature extraction poses a challenge for both the system and teachers, as semantics range on a spectrum from fully explicit to fully implicit. To enable a more precise evaluation, it would first be necessary to define which meanings are explicitly present. Our study on teachers' cognition shows that these exact considerations also play a role in teachers' feature extraction, suggesting that there are no fixed rules for conducting an implicit or meaning-based evaluation.

κFW\kappa_{FW} = low, RMAE = high

This feature extraction leads to inconsistent evaluations that are classified as errors. The error is clearly identifiable — however, in this instance, the issue is not a misdirected feature extraction but rather an incorrect execution of the instructions.

The error does not occur systematically. It is suspected that, in some cases, the model's attention is not sufficiently focused on correctly following the instructions. A possible solution would be to direct the model's attention more toward the precise execution of instructions. However, increasing focus in this area could lead to reduced attention in others, potentially introducing new errors.

Conclusion

In summary, the feature extraction by the GPT model in this class assignment deviates by a median of 5.6% from the teacher's evaluation. For feature extraction, the evaluations of the teacher and the system show an almost perfect agreement, with a Quadratically Weighted Kappa (κW\kappa_W) of 0.84. When aggregating the extracted features, the error is reduced even further, resulting in an effective deviation in RTSE of −1.8% (median) in the total score. This means that, on average, the system assigns 1.8% more points than the teacher.

It is important to highlight that this result is based on the model's adaptation to four student responses. Further improvement is likely if the teacher makes additional adjustments.

However, these values do not provide a general statement about the extraction capability of GPT models. It remains unclear which features teachers typically extract in different contexts. To make more reliable statements, it would be necessary to analyze a large number of representative feature extractions across various subject areas. To reach a more general conclusion, we are gradually expanding our study with additional class assignments from different subject areas.

Weaknesses

  • The automated grading system requires teachers to engage in a detailed analysis of complex error sources, where fixing one error may introduce another.
  • The underlying GPT model is continuously evolving and retrained with new data. As a result, its behavior changes over time, meaning teachers cannot create a grading model once and rely on it indefinitely. They must regularly review and reassess its functionality.
  • The grading system requires a large number of specific instructions, which can lead to complex and nested hierarchies. The formulation of these instructions does not follow clear guidelines and can be sensitive to linguistic nuances.
  • Although only a few student responses are required to develop the grading model, the accuracy of its evaluations remains uncertain in unfamiliar cases.
  • To ensure the model's quality and minimize systematic errors, ongoing evaluation is necessary. Representative exams must be analyzed to make heuristic adjustments and systematically optimize the grading model.

Strengths

  • In addition to the high satisfaction of experts with the system's evaluations, its reproducibility is very high.
  • The grading system provides extensive control and flexibility by allowing for any level of granularity. Teachers can precisely determine the level at which the evaluation takes place and how detailed the analysis should be.
  • The instructions serve as a universal control mechanism, enabling the extraction process to be directed in different ways. The extraction itself is based on linguistic definitions and is limited only by language constraints rather than technical or structural requirements.
  • The system can be used for all school subjects and domains, as the test creator can specify the required information and procedures for feature extraction.

Additional Cases

Spelling Exam

The spelling exam consists of a total of nine tasks and was completed by 47 students. Each task contains multiple response fields, resulting in a total of 1,833 answers. It consists exclusively of convergent tasks, as each has an objectively assessable solution. The tasks can be divided into three categories: the spelling section (covering capitalization, word types, and word formation — four tasks), three tasks assessing the correct use of "das" or "dass", and two tasks focusing on comma placement.

NodeRMAERTSEκ\kappaκW\kappa_WEA
F10.0210.0210.8770.8770.979
F590.0000.0001.0001.0001.000
Mean0.019−0.0060.9390.9490.971
Std Dev0.0240.0280.0810.0590.055
Variance0.0010.0010.0060.0040.003
Median0.0210.0000.9560.9560.979
Minimum0.000−0.1060.5670.7310.638
Maximum0.1060.0431.0001.0001.000

References

  1. Bejar, I. I., Williamson, D. M., & Mislevy, R. J. (2006). Automated Scoring for Complex Assessments. ETS Research Report.
  2. Bennett, R. E., & Sebrechts, M. M. (1996). The accuracy of expert-system diagnoses of mathematical problem solutions. Applied Measurement in Education, 9(2), 133–150.
  3. Bortz, J., & Döring, N. (2006). Forschungsmethoden und Evaluation für Human- und Sozialwissenschaftler. Springer Medizin Verlag.
  4. Brenner, H., & Kleissch, U. (1996). Dependence of Weighted Kappa Coefficients on the Number of Categories. Epidemiology, 7, 199–202.
  5. Bridgeman, B. (2013). Human Ratings and Automated Essay Evaluation. In M. D. Shermis & J. Burstein (Eds.), Handbook of Automated Essay Evaluation (pp. 221–250). Routledge.
  6. Cronbach, L. J. (1988). Five Perspectives on the Validity Argument. In Test Validity.
  7. Girard, J. (2024). Cohen's Kappa Coefficient — Formulas and MATLAB Implementation.
  8. Gwet, K. L. (2014). Handbook of Inter-Rater Reliability, 4th Edition. Advanced Analytics, LLC.
  9. Hammann, M., & Jördens, J. (2014). Offene Aufgaben codieren. In Methoden in der naturwissenschaftsdidaktischen Forschung (pp. 169–178). Springer-Verlag.
  10. Kleinmuntz, B. (1963). MMPI decision rules for the identification of college maladjustment. Psychological Monographs, 77(14), 1–22.
  11. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
  12. Maier, U. (2015). Leistungsdiagnostik in Schule und Unterricht. Bad Heilbrunn: Verlag Julius Klinkhardt.
  13. Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2).
  14. Rost, J. (1996). Lehrbuch Testtheorie Testkonstruktion. Hans Huber, Bern.
  15. Schmiemann, P., & Lücken, M. (2014). Validity — Misst mein Test, was er soll? In Methoden in der naturwissenschaftsdidaktischen Forschung.
  16. Williamson, D. M., Bejar, I. I., & Sax, G. (2004). Human Scoring. In Automated Scoring of Complex Tasks in Computer-Based Testing.
  17. Xiao, C., Ma, W., Xu, X. S., Zhang, K., Wang, Y., & Fu, Q. (2024). From Automation to Augmentation: Large Language Models Elevating Essay Scoring Landscape. arXivLabs.
  18. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.