When it comes to the Duolingo English Test, a question I frequently hear from stakeholders I meet is, “How valid is the test?” The “validity and reliability” of a test seem to be on top of the minds of test users. So, what exactly is test validity? How does the domain of language assessment define it, and how can a test demonstrate that it’s a valid measure of language proficiency? Let’s delve into this together!

What is Validity in Language Assessment?

A language assessment (or test) is a tool used by many stakeholder groups to collect information about one’s language ability. To be valid, this tool needs to be systematically grounded in existing theory as well as empirical evidence. 

According to the Standards for Educational and Psychological Testing published by The American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME) (2014), test validity is the extent to which evidence and theory support the interpretations of test scores for proposed uses of the test. In other words, a valid English test needs to demonstrate that it measures what it’s supposed to measure (i.e., English language ability) so that test users (e.g., university applicants, admissions officers, employers, etc.) can utilize its scores confidently for their purposes. 

There are a few validity frameworks in language assessment that researchers and test developers have used to validate a test. Construct validity (or what it is that the test is trying to measure) is central to those. Famously, Messick (1989) and later Chapelle (2020) argued that since test scores have meanings that reflect test takers’ language knowledge, it is important to define the construct to guide test development and predict test performance. 

But what else beyond construct validity does an English test need to demonstrate? For Bachman and Palmer (1996), test validation studies need to be guided by test usefulness. More recently, however, the field has pivoted towards argument-based validity, where test developers need to make clear statements about proposed test score interpretations and uses (i.e., inferences we can draw from the score) and provide evidence for them (Kane, 2006).

What inferences can we make from a valid test?

What conclusions should users of a valid test be able to make? For one, they need to be able to use the test score for their intended use (e.g., university admissions). They also need to be able to trust that test takers’ performance is summarized and reflected accurately and reliably in scores. 

Therefore, there are several claims that a valid language test needs to make:

  1. Test developers have analyzed the target domain of language use (e.g., university study) to create the test tasks that elicit relevant performance.
  2. The test is designed to measure language skills through tasks that appropriately reflect target language use situations (e.g., university study).
  3. The test is the basis for high-stakes decisions about future educational or employment prospects of test takers (e.g., students admitted with a certain test score perform well at a university). 
  4. For tests used in university admissions specifically, the scores reflect the level of language performance that test takers are likely to display in their academic studies.
  5. The test generates positive test preparation activity by designing tasks that encourage learners to develop their relevant language skills rather than “study for the test”.
  6. The scores on the test reflect varying levels of language knowledge as well as performance consistency. 

These claims fall under six categories of the test validation process identified in Chapelle (2020): domain definition, evaluation, generalization, explanation, extrapolation, and utilization of test scores.

How has the Duolingo English Test demonstrated its validity?

Through internal and external research, the Duolingo English Test has built a strong validity argument to support these claims and justify the interpretations and uses of the DET scores for decisions about admissions to English medium universities.

Domain definition

For each task on the DET, we conduct a domain analysis to support content validity of the test and ensure that test-taker performance on these tasks is relevant to the target domain. For example, Goodwin et al. (2024) detailed the decisions behind designing our Interactive Writing task, identifying relevant task features and evidence types to ensure that it assesses the target writing construct.

Evaluation

The DET scoring team develops the DET scoring models and analyzes (and refines, if needed) their performance. We also develop scoring rubrics that our human raters use to provide scores to a subset of spoken and written production-based tasks. We use these data to calculate how well our automated scoring models agree with human scoring (spoiler alert, the agreement is very strong with Pearson correlations over 0.85). This argument is conceptually related to construct validity of a test.

Generalization

The scoring team also monitors the DET score reliability (an important aspect of test validity) across test sessions, test taker groups, etc. The DET test-retest reliability coefficients are extremely high (over 0.90) for all subscores and the overall score.

Explanation

One of the ways to demonstrate that test scores are an indicator of academic English proficiency is through its relationships with other tests. To that end, Cardwell (2024) calculated correlations between DET, TOEFL iBT, and IELTS Academic scores using over 7,000 score reports and showing a consistently strong relationship between the three tests’ scores. This serves as evidence for concurrent validity of the DET.

Extrapolation

We have demonstrated that the DET scores predict test takers’ performance in university contexts just as well as other high-stakes tests through our recent meta-analysis study. Externally, Isbell et al. (2023), for example, also showed that the DET overall scores as well as Production and Conversation subscores strongly predict test takers’ academic preparedness, as perceived by professors, fellow students, and staff. This is representative of predictive validity of a test.

Utilization

The DET test scores are used to make admissions decisions by over 5,000 English-medium universities! To examine how the test affects test takers’ English learning, we are currently running a study with external research collaborators to examine the DET’s washback, contributing to construct validity evidence.

Test validity is certainly a crucial aspect of language assessment. Otherwise, how do you know that a test accurately measures what it intends to measure and provides credible information about one’s English proficiency for a certain context? We have built a solid validity argument of the DET by addressing the key components for test validation and establishing ourselves as a trusted high stakes proficiency assessment around the globe.

Search