This year’s Language Testing Research Colloquium (LTRC) took place in Bangkok, Thailand, at Chulalongkorn University, under the theme “Language Assessment in Multicultural Contexts: West Meets East.” The Duolingo English Test was honored to serve as a platinum sponsor, helping bring together researchers from across the globe to examine how assessments must evolve to meet the needs of diverse populations.

We joined researchers from around the world to explore how assessments can better reflect the linguistic and cultural diversity of today’s test takers, and presented new work that delved into critical issues shaping the future of testing: fairness, contextualization, accessibility, and transparency.

Eight people standing in a line facing the camera. They have their arms around each other and are smiling.
From left, DET researchers Yigal Attali, Andrew Runge, Yena Park, Jacqueline Church, Alina von Davier, Will Belzak, Ramsey Cardwell, and Geoff LaFlair

AI, human judgment, and collaboration

In the symposium Beyond “In the Loop”: Human-AI Collaboration in Human-Centered Language Assessment, the DET team joined a group of leading researchers to explore how AI can support—not supplant—human expertise in assessment.

Geoff LaFlair, Alina von Davier, and Jill Burstein presented Human-centered AI for Language Assessment Development and Administration, sharing how the DET is designing systems that prioritize fairness, transparency, and human oversight at every stage of test development. Their work emphasized that more than a philosophy, human-centered design is an engineering principle that guides how AI tools are integrated and evaluated in real-world, high-stakes testing contexts.

Other presentations in the symposium included Eunice Jang and Liam Hannah on human-AI teaming in scoring environments, Erik Voss on strategies for making AI scoring models more interpretable, and Alistair Van Moere and Jing Wei on frameworks for achieving mutual understanding between humans and AI systems.

Together, the panel advanced a shared vision: that responsible AI in assessment must be intelligible, adaptive, and human-guided.

Can expert reviewers detect item bias?

Jacqueline Church, Will Belzak, Yigal Attali, and Yena Park presented a study that examined whether trained reviewers could reliably detect differential item functioning (DIF)—a marker of item bias—on DET test items. Their analysis compared reviewer judgments to logistic regression results, using overall DET scores as a matching proxy.

They found that gender-based DIF was more easily detected than nationality-based DIF, likely due to greater global awareness of gender-related fairness issues. They also observed that image-based items were slightly easier to evaluate, possibly because visuals offer more intuitive cues for identifying bias.

“One writing prompt about academic preferences ended up favoring Chinese test takers,” noted Jacqueline. “It wasn’t something we would have flagged as problematic, but the data revealed a measurable effect.”

The findings suggest the value of human review—but also its limits—and highlight the need for more robust reviewer training and culturally diverse perspectives in content evaluation.

What do localized images really do?

Geoff LaFlair, Jacqueline Church, and Andrew Runge shared research on the impact of localized images in writing prompts. Their study tested whether culturally familiar visuals influenced test taker responses compared to more generic, globally neutral images.

The results revealed a small but statistically significant effect on perceptions—indicating that familiarity may slightly shape how test takers engage with a prompt. Participants and discussants noted that test takers might feel more personally connected to familiar images, which could encourage narrative elaboration or even translanguaging when test takers reach beyond their English proficiency to express complex ideas.

“If someone connects deeply to an image but lacks the exact English words to express it, they might go off-topic or switch registers,” said Ramsey Cardwell during the Q&A. “We need to account for that kind of engagement.”

This line of inquiry sparked further discussion about how image selection can affect test validity and how test developers can balance contextual richness with fairness and comparability.

Rethinking what accommodations should look like

In a study on extended time accommodations, Ramsey Cardwell and Will Belzak, alongside collaborators Jill Burstein and Ruisong Li, explored whether giving test takers more time actually improves outcomes across different groups.

Their findings revealed that extended time doesn’t always yield better performance. This was especially notable during sequences of cognitively demanding tasks presented in quick succession. 

The study also revealed potential inequities in accommodation access: test takers from countries with limited disability support services were less likely to request accommodations, even when they might benefit from them. For test takers with Autism, performance effects were mixed, pointing to a need for more personalized and flexible accommodation models.

Their research sparked discussion on alternatives, including micro-breaks or buffer screens, which could be integrated into DET’s structure without compromising test security. The team also raised concerns about equity in access: test takers from countries with limited disability support are often less likely to request accommodations, even when needed.

What fairness looks like in remote proctoring

In two sessions on proctoring, Will Belzak and Alina von Davier examined how automated and human-monitored proctoring systems can better reflect test taker diversity.

Their first talk unpacked DET’s tiered consequence system and emphasized the importance of transparency, particularly when evaluating gaze behavior—which may vary widely across cultures and neurotypes.

“Some gaze behaviors might appear unconventional but aren’t inherently suspicious,” Will explained. “We design our systems to allow for that nuance.”

In their second session, Proctoring Language Assessments in Multicultural Contexts, the pair looked more broadly at how global differences in test-taking behavior intersect with proctoring tools and protocols. Their message: secure does not have to mean inflexible, and fairness must be a design priority—not an afterthought.

Looking ahead

At a conference focused on multicultural contexts, the DET team brought exactly what the moment called for: transparency, curiosity, and a commitment to designing assessments that reflect the diversity of today’s learners.

We’re deeply grateful to Chulalongkorn University for hosting this year’s colloquium with such warmth and care, and to the LTRC 2025 organizing committee for orchestrating a seamlessly run, intellectually vibrant event. And our sincere thanks to ILTA for continuing to provide a venue where language assessment researchers can share their work, exchange ideas, and grow together as a community.

We left Bangkok energized by the conversations we had—and inspired by the possibilities ahead!


Search