Take almost any standardized language proficiency test, and you'll encounter a certain structure: a section on listening, then reading, then writing, then speaking (i.e., receptive skills before productive ones). In other words, items (that is, the questions and tasks that make up the test) are grouped by skill. But have you ever wondered why tests are organized this way? And whether there's actually any evidence that this approach to test design is optimal?

These are the questions we address in a new systematic review published in the journal Language Testing.

What research says about language test ordering

To find out what empirical research says about test ordering, we reviewed 88 studies published from 1933 to 2023 that examined how the ordering of items, tasks, and sections affects test-taker performance, perceptions, and affect (i.e., emotions, feelings, and moods).

What we found was striking: despite the field's emphasis on rigorous test validation, only three of the 88 studies we reviewed directly examined ordering effects in language proficiency assessment contexts. The vast majority of research has been conducted in university classroom settings, primarily on multiple-choice exams. Given the diversity of testing contexts and test formats, this research base is narrow in scope.

For a design decision baked into virtually every high-stakes language test, the empirical foundation is remarkably thin.

Do easier questions improve language test performance?

The most studied question in the ordering literature focuses on difficulty: should tests present items from easy to hard, hard to easy, or some other configuration? The evidence, while mixed, tends to favor starting with easier items: test takers who begin with lower-difficulty items perform modestly better on average, particularly under time pressure, and report lower anxiety levels. These effects vary across groups of test takers and testing conditions rather than applying uniformly, and are strongest among lower-proficiency and more anxious test takers. And when features of a test impact test-taker subgroups differently, there is potential for unfairness.

The Duolingo English Test's design reflects the easy-to-hard ordering approach. The test begins with relatively simple vocabulary tasks, and its computer-adaptive format ensures that each test taker receives items calibrated to their proficiency level throughout—neither so difficult as to be demoralizing, nor so easy as to be uninformative. This is one expression of the DET's broader commitment to the test-taker experience and to minimizing sources of test-induced anxiety.

Why do language tests have skill-based sections?

Beyond item difficulty, our review raises a more fundamental question: why do language tests group all listening items together, then all reading items, and finally writing and speaking items?

This arrangement is standard practice across the field. But we found essentially no empirical evidence that skill-based grouping leads to better measurement or fairer outcomes than, for example, alternating between task types. Section ordering in language testing appears to be driven largely by convention, intuition, and by practical constraints of paper-based and group-administered testing, not by research demonstrating it is optimal.

The DET does not strictly group items by skill. In the first half of the test, reading and listening tasks alternate; in the second half, writing and speaking tasks alternate. This structure reflects practical and psychometric design decisions rather than a claim that alternating task types is empirically superior to skill-based grouping—because, as our review makes clear, that evidence does not yet exist. What is clear is that the conventional approach deserves scrutiny rather than assumed superiority.

Why language test ordering matters for fairness and validity

For admissions offices and faculty who rely on language proficiency scores, these findings point to an important consideration: design choices that seem unremarkable, like the order in which sections appear, are not neutral. They can influence test-taker experience and, potentially, how accurately scores reflect actual proficiency.

Ordering is a fairness and validity issue, not merely a logistical one. A test that introduces unnecessary anxiety or disadvantages lower-proficiency test takers through its structure adds construct-irrelevant variance — noise that makes scores less meaningful. Institutions deserve to understand how the tests they accept are designed, and to ask whether those choices are grounded in evidence.

What comes next for language test design research

Perhaps the most important takeaway from our review is how much remains unknown. The research base on test ordering in language assessment is thin, raising more questions than it answers. Those questions matter especially for computer-adaptive and digital-first tests, which have a degree of technical flexibility that paper-based tests never did. The order in which content is delivered no longer has to be fixed, opening the door to dynamic ordering approaches and to rethinking assumptions about test design that have gone largely unexamined for decades.

We hope this review will serve as a starting point for more empirical work. And if you're a researcher interested in this area, we encourage you to consider applying for one of the DET's competitive research grants or doctoral dissertation awards!

Search