We recently published a paper in Language Testing, a leading journal in the field, about the Duolingo English Test’s Interactive Writing task. The article shares what we’ve learned from developing, refining, and validating a new kind of writing task that asks test takers not just to respond to a prompt, but to build on their ideas across multiple stages. 

It also surfaces important lessons about how test takers respond to interactive tasks, what kinds of prompts support high-quality writing, and how automated systems can enhance human-informed design.

What we set out to test

Most writing tasks in high-stakes English assessments ask for a single response to a single prompt. The format is simple: you read a question, write your answer, and submit. But this structure doesn’t reflect how writing actually works outside of tests.

In academic and professional settings, writing is rarely a one-and-done activity. Writers often build on their ideas, respond to new information, and revise based on feedback or changing goals. A student might expand an essay after a professor’s comments. A job applicant might tailor a personal statement after learning more about a role. Even casual writing—emails, blog posts, proposals—often evolves over multiple drafts or in response to prompts from others.

We created the Interactive Writing task to bring assessment closer to that reality. Instead of a single responding to a single prompt, the task presents test takers with a second, customized follow-up that pushes them to think more deeply, consider a new angle, or elaborate on what they didn’t cover the first time. It’s a small change in structure, it has big implications for what writing assessments can capture.

In our study, we tested whether this process works the way we intended: whether it was fair, functional, and effective across a wide range of test takers.

Four things we learned

1. Theme detection can be accurate and scalable.

We trained and tested a theme detection model to identify the main ideas in test taker responses. A fine-tuned GPT-3.5 model struck the right balance between precision (avoiding repeated themes) and recall (not missing key ideas). This allowed the system to select follow-up prompts that were relevant and non-redundant, even in real time.

2. Prompt phrasing matters.

We experimented with four different ways to phrase follow-up prompts. One version, called the “free” format, led to more fluent, focused responses. This suggests that concise, flexible prompts better support test takers than those that sound overly scripted or prescriptive. Even small wording choices had measurable effects.

3. Writers elaborated in meaningful ways.

Human raters and AI models evaluated whether test takers were truly expanding on their original ideas. The results were clear: most second responses were on-topic, connected to the first, and added new information. In other words, test takers weren’t just repeating themselves—they were writing more like people do in the real world.

4. The task worked across different prompts and participants.

Across topics and proficiency levels, test takers were able to engage with the task and produce extended writing. The system selected appropriate follow-ups in real time, and did so without introducing unfairness into the test experience.

Why this research matters

Few large-scale tests attempt this kind of interactive design, especially in writing. And even fewer publish detailed validation studies in peer-reviewed journals.

This paper shows that AI can support richer writing assessment, not by replacing human input, but by enabling a test experience that adapts to the user. It also underscores the importance of prompt design, language modeling, and response analysis as core components of modern test development.

Altogether, the study supports what sociocognitive models of writing have long emphasized: writing isn’t just production, it’s also interaction. And assessment should reflect that.

We’re honored to have this work published in Language Testing, and grateful for the opportunity to contribute to the broader conversation about the future of language assessment.

Search