In language testing, there’s an ongoing debate: how much time should test takers be given to complete a writing task? On the one hand, longer durations might better reflect real-life writing processes; on the other hand, it's possible to imagine that beyond a certain point, longer durations are not contributing much to our assessment of test taker's writing skills.
This question led our assessment research team to conduct a study on writing task duration and its effects on test scores, reliability, and validity. Here’s what we discovered.
Research design: Comparing 5-minute and 20-minute writing tasks
To investigate, we asked adult L2-English writers to complete writing tasks with either a 5-minute or a 20-minute limit. Responses were then scored by both human raters and an automated writing evaluation (AWE) tool, allowing us to compare how these durations affected scores, score reliability, and score validity.
We hoped to answer key questions:
- Does writing performance improve with more time?
- Does giving more time enhance the reliability and validity of writing scores?
- How do scores differ when rated by humans vs. automated tools?
Finding 1: Longer tasks produce slightly higher scores—but modest gains
As expected, writers produced longer and slightly higher-scoring responses in the 20-minute condition than in the 5-minute one. In part, these differences were due to both durations being rated with the same task expectations.
In terms of length, these differences weren’t proportional to the additional time: while the task duration quadrupled, the word count only doubled. Critically, under both conditions, test takers evidenced the full range of writing proficiency from beginner to expert.
Finding 2: Reliability and validity remain stable across time limits
Interestingly, our study found that the reliability and validity of scores were similar across both the 5-minute and 20-minute tasks. Reliability measures, which assess score consistency, showed no significant differences between durations. Similarly, criterion validity—how well these scores aligned with other standardized test results like IELTS—was equally robust in both short and long conditions.
These findings suggest that shorter tasks, if designed effectively, can be just as reliable and valid as longer ones. This has big implications for test design, as shorter tasks could provide a faster, less stressful way to assess writing ability without compromising quality. Moreover, given a fixed time for a writing assessment (say 20 minutes), our results demonstrate that it would be much more beneficial to administer 4 short tasks than one long one, allowing for responses on different topics and with different communicative purposes.
Finding 3: Automated and human scoring show consistent patterns
The lack of advantage for longer durations was found for both human raters and the AWE tool. However, we found that automated scores tended to show slightly higher reliability and validity than human scores. This consistency across scoring methods is a positive indicator that automated scoring, when carefully calibrated, can be a reliable and accurate measure in language assessment.
What this means for institutions
For admissions teams and assessment designers, the results offer evidence-based guidance:
- Shorter writing tasks can produce reliable and valid scores
- Increased duration does not necessarily improve measurement precision
- Multiple short tasks may provide broader evidence than one long task
- Automated scoring can support consistent and defensible evaluation
Writing assessment design is always a balance between measurement quality, operational efficiency, and test-taker experience. This study suggests that institutions do not need to assume that longer writing tasks are inherently more valid.
Instead, task design, scoring models, and evidence-based validation matter more than duration alone.
Looking ahead: Evidence-based writing task design
The results of this study open up new possibilities for efficient, reliable writing assessments. At the DET, we’re committed to using research to inform our test design and improve the experience for test takers and institutions alike. By creating tests that are both fair and practical, we can help ensure that English proficiency assessments accurately reflect a test taker’s skills while fitting seamlessly into their lives.
Frequently Asked Questions
Does longer writing time improve reliability in language assessments?
Not necessarily. In this study, 5-minute and 20-minute writing tasks showed similar reliability estimates. Increased duration did not meaningfully improve score consistency.
Are short writing tasks valid for high-stakes language testing?
Yes. When carefully designed, shorter writing tasks demonstrated criterion validity comparable to longer tasks, including strong relationships with external measures such as IELTS scores.
How does writing task length affect writing performance?
Longer tasks produced slightly longer and slightly higher-scoring responses. However, the performance gains were modest and not proportional to the additional time provided.
Is automated writing evaluation (AWE) as reliable as human scoring?
In this study, automated scoring demonstrated reliability and validity indices comparable to, and in some cases slightly higher than, human raters. This supports the use of calibrated automated scoring in scalable language assessment.
Should long writing tasks be replaced with multiple short tasks?
Research suggests that, within a fixed total testing time, multiple short writing tasks may provide broader evidence across topics and communicative purposes than a single extended task.
What matters more than writing task duration in assessment design?
Task design, scoring models, and validation evidence have a greater impact on score quality than duration alone. Institutions should prioritize evidence-based design decisions rather than assuming longer tasks are inherently better.