Until now, large-scale speaking tests have always made a tradeoff between authentic interaction and scalable delivery. In practice, that’s meant choosing between live interviews with human examiners, or scripted, one-way prompts delivered by a computer.
But with the launch of Interactive Speaking, the Duolingo English Test introduces a first-of-its-kind innovation: a fully automated speaking task that simulates a real conversation, adapting to each response in real time.
This breakthrough in language assessment brings true interactivity to speaking, without sacrificing consistency, fairness, or scale.
The limits of traditional speaking tasks
High-stakes English tests have relied on two main types of speaking tasks: direct and semi-direct. Each offers advantages, but also disadvantages and tradeoffs.
Direct tasks involve a live conversation between a test taker and a human examiner. This format feels authentic—but it’s incredibly difficult to scale. It requires trained interviewers, standardized conditions, and time-consuming scoring. Even then, scoring can be inconsistent. Different interviewers may ask slightly different questions or apply rubrics differently, which raises questions about fairness, reliability, and comparability. And for many test takers, speaking to a live examiner—especially under timed, high-stakes conditions—can be a deeply stressful experience.
Semi-direct tasks eliminate the human interviewer. Instead, test takers hear a question, respond aloud, and move to the next one. These tasks are easier to administer, more scalable, and much easier to score in a consistent way. But they come at a cost: they aren’t interactive. There’s no follow-up, no dialogue. The test doesn’t respond to what the person says, it just queues up the next prompt. It’s a sequence of monologues, not a conversation.
Both formats are based on a tradeoff: you can have authenticity (with a human), or efficiency (with a computer), but not both. And in today’s world, that’s no longer good enough.
We needed a third option. One that combines the realism of real-time interaction with the consistency and accessibility of modern technology.
A more authentic alternative—powered by AI
Interactive Speaking is built to fill that gap. It simulates a live conversation through a sequence of question-and-answer turns between the test taker and a virtual character to elicit real-world interaction skills, like topic development, contextual appropriateness, and spoken response organization.
Based on their response, a test taker may get a follow up question that asks them to elaborate on a topic they mentioned in the previous response. We are also looking at test takers' spoken fluency—a crucial feature for successful communication.
Each test taker experiences:
- 6 to 8 conversation turns
- 35 seconds to respond to each prompt
- No prep time—just like real conversation
- Personalized follow-up questions based on their previous answers

Every turn in the conversation is dynamically selected in real time, based on what the test taker has said so far. Test takers' responses are evaluated for how well they answered the question and what was said on the fly, and these considerations in turn guide which questions are asked next.
This kind of interaction would be almost impossible to script by hand. But with generative AI, adaptive scoring models, and real-time response analysis, it becomes not just possible, but practical.
What’s behind the scenes
Building this system took a multidisciplinary team effort.
Applied linguists designed hundreds of conversation topics and rhetorical patterns, and AI researchers developed generation and evaluation pipelines using state-of-the-art models like GPT-4o to create each component of the task (including questions and question-specific rubrics for evaluating how well test takers answer the question).
Content experts and reviewers ensured all test content was fair, clear, and inclusive across all proficiency levels. Assessment scientists, psychometricians, and AI engineers built the scoring models that interpret response quality across six subskills: fluency, grammar, pronunciation, vocabulary, coherence, and task completion. Finally, engineers and product designers integrated the system into a seamless test experience.
Unlike some recent attempts to use LLMs for spontaneous dialogue, Interactive Speaking avoids the risks of open-ended generation by using pre-written content, carefully curated and quality-controlled. Every prompt and rubric is reviewed by human experts, and the system evaluates responses using structured rubrics to guide scoring and follow-up selection.
This balances authenticity with reliability, making it suitable for high-stakes use.
How does Interactive Speaking impact scores?
Adding a new question type to a high-stakes test is never just about design. It’s also about impact. Will the new item produce reliable scores? Are those scores comparable to existing measures of speaking ability? Can we trust them to reflect what test takers can really do?
Our research team explored these questions in depth. Here’s what they found:
- Scores from Interactive Speaking are highly reliable. Internal consistency was strong, with reliability estimates above 0.90—on par with the most trusted speaking assessments used today.
- Interactive Speaking scores correlate strongly with other measures of speaking proficiency. They aligned closely with both Duolingo’s existing speaking tasks and with human-delivered interview scores like those from IELTS. In fact, correlations with IELTS speaking scores were slightly higher for Interactive Speaking than for traditional monologic tasks.
- The scoring system works in real time. Each response is automatically transcribed, analyzed, and scored using a combination of AI and structured rubrics. This allows the system to instantly choose follow-up questions and adapt the conversation based on the test taker’s performance—without introducing inconsistency.
- AI scoring is as consistent as human raters. When researchers compared how experts and AI scored task completion, agreement levels were nearly identical. The AI identified which key ideas had been addressed in a response with the same level of accuracy and reliability as trained human raters.
- The task captures interactional skills that monologic tasks often miss. Linguistic analysis showed that Interactive Speaking responses were more involved, more personal, and more characteristic of real-life dialogue—evidence that this item captures different dimensions of language use than traditional tasks.
In short, the new item improves the quality of what’s being measured, without sacrificing reliability or fairness.
Inclusive by design
Because the test adapts to each person’s responses, it can adjust difficulty in real time. This in itself is an innovation: the Interactive Speaking task is the first adaptive production (that is, writing or speaking) task in large scale testing.
This adaptive design means everyone has the opportunity to show what they know, fairly. Test takers are never overwhelmed by questions that are too hard, and more advanced speakers are challenged with richer, more complex prompts. And because the system can flag and avoid repeat content across conversations, it also protects test security while supporting continuous content refresh.
Redefining what a speaking test can be
With Interactive Speaking, we have shown that it’s possible to combine the best of both worlds: The authenticity of a back-and-forth interaction, the efficiency and consistency of computer-based testing, and the innovation of generative AI, grounded in expert-reviewed design.
For learners, the experience is more natural and less stressful. For institutions, the scores offer richer insights into communicative ability. And for the field of language testing, it opens the door to a new generation of assessments—ones that are interactive, adaptive, and powered by human–AI collaboration.