Have you ever sought a second opinion about a medical diagnosis? Or gotten multiple estimates for a costly car or house repair? Ever noticed that judges write legal “opinions” to justify their decision about a court case?

What these scenarios have in common are experts (doctors, mechanics, judges) making decisions that have important consequences for people and society. They also imply that experts can vary in their judgment, even with years of training and experience. But what does this have to do with the Duolingo English Test?

Proctors make decisions

Like all high-stakes tests, the DET is proctored by highly-trained individuals who make judgments about test-taking behaviors–namely, whether test takers have followed the testing rules.

Based on these proctors’ judgments, test takers may or may not receive a certified test score, which can be used for university admissions or other purposes. Proctors must decide, for instance, whether a test taker is looking away from the computer screen because they are thinking (i.e., an innocent behavior), or because they may be accessing prohibited materials (i.e. cheating).

Subjectivity among proctors can undermine fairness and reliability of the testing process. When judgments on test-taker behavior vary widely among proctors, it introduces inconsistency that can potentially affect the validity and integrity of test results. This variability could unfairly disadvantage some test-takers based on subjective interpretations of their actions during the test.

How much do proctors vary in their decision making?

When my colleagues and I explored this question in a new peer-reviewed paper, we discovered three interesting things.

First, we identified small-to-moderate amounts of variability in proctor decision making for the DET. For perspective, this variability was similar in magnitude to the variability found in medical professionals’ measurements of patients’ respiratory rates, or variability in ER nurses’ judgments about the severity of patients’ illnesses.

Second, some kinds of proctoring decisions were more variable than others. Figure 1 shows the probability that two proctors agreed whether a test taker’s behavior was “OK” (certified test score) or constituted “Rules Broken” or “Cheating” (uncertified test score).

Figure 1. Proctoring agreement over time (by quarter).

Proctors were more likely to agree about “Cheating” behaviors (plagiarism) than they were about “Rules Broken” behaviors (looking away excessively). Given that the consequences for “Cheating” are more severe – namely, suspension from taking the DET – this result was encouraging.

How can we reduce this variability? 

Finally, while decision making did vary among proctors, this variability decreased (and agreement increased) over time. This is shown more clearly in Figure 2 for specific test-taking behaviors, such as plagiarism and looking away excessively, the latter of which showed more than a 50% reduction in variability by one measure.

Figure 2. Variability in proctor decision making over time (by month).

We identified three likely reasons for why variability decreased (agreement increased) over time:

  1. Proctors who were identified as outliers in their decision making–for instance, judging too many test takers as “looking away”–were retrained on the best practices for proctoring.
  2. Automated and AI-based tools, such as plagiarism detection and multiple keyboard detection, led to more objective and less variable decision making; for instance, about 90% of proctors agreed with the automated plagiarism detector when it identified plagiarism.
  3. Metrics summarizing variability in decision making were closely monitored each month, with increases in variability investigated for the root cause.

Each of these reasons have been turned into key metrics the DET uses for maintaining high levels of proctor agreement in decision making and continuously monitoring the quality of proctoring.

Ensuring integrity and fairness in test scores

In addition to the variability-reduction strategies identified above, the DET ensures proctors’ decisions are consistent for all test takers by (1) having clear guidelines for when a test taker has broken a test rule, (2) calibrating proctors on best practices for making decisions, and (3) having multiple proctors make decisions about the same test taker. These tactics are similar to how doctors have “Standards of Care” and how many legal systems have multiple tiers of judges making decisions (appellate courts).

Reducing variability in proctor decision making means that honest test takers are more likely—and dishonest test takers are less likely—to receive a certified test score. This is not only critical for ensuring scores are valid, fair, and credible to all stakeholders of the DET, it’s also critical for stakeholders of all high-stakes assessments, regardless of whether proctoring is done remotely or in person (at a test center).

To learn more about how subjectivity in proctor decision making is mitigated for test takers and test users, read the full paper in Educational Measurement: Issues and Practice.