When admissions teams, pathway providers, or policymakers evaluate English proficiency tests, the first reaction is often intuitive: Does this look like a legitimate language test? That instinct corresponds to face validity. Face validity refers to whether a test appears to measure the skill it is intended to assess. In high-stakes contexts, face validity can feel especially important. Tests that look unfamiliar may raise concerns about credibility, comparability with other tests, and defensibility in admissions or policy decisions, even before any technical evidence is reviewed.

These concerns are understandable. Institutions operate within complex ecosystems and must justify their assessment choices to multiple stakeholders. Tests need to be explainable and acceptable to stakeholders, not just technically sound. Face validity often serves as an initial signal of whether a test seems to belong in the category of “serious” language assessments.

But first impressions are not the same thing as evidence.

In this blog post, I unpack what face validity is, why it matters to stakeholders—and why it should never stand alone when evaluating the quality of a language proficiency test.

What is face validity, and why does it matter?

Face validity refers to the extent to which a test appears to measure the construct of interest. If a test asks learners to read a short passage and answer comprehension questions, many observers will conclude that it looks like a reading comprehension test. That intuitive judgement is the essence of face validity: does this assessment seem relevant and appropriate for the target skill? 

Importantly, face validity is subjective. It reflects the perceptions of test takers, admissions officers, or other stakeholders, rather than systematic analysis. Even so, it plays a meaningful role in applied settings. Assessments that lack face validity can undermine test takers’ confidence, reduce engagement, and create resistance among score users. In educational or institutional contexts, high face validity can support acceptance of scores among admissions officers and educators. It matters to real people, not just assessment designers. If stakeholders perceive a test as credible and relevant, they’re more likely to trust its outcomes. From a practical standpoint, it’s often the first door a test must walk through.

The limits of face validity

Here’s the catch: face validity is inherently superficial. It evaluates how a test looks, not how it functions as a measurement tool. Unlike more rigorous forms of validity evidence, such as content, construct, predictive, or criterion validity, face validity doesn’t rely on empirical data or statistical analysis. 

In language assessment, this distinction matters. Stakeholders’ impressions can be shaped by familiarity, norms, and expectations, rather than by strong evidence of performance (e.g. Green, 2020). For example, a test with questions that appear to measure English skills might still fail to produce scores that relate to important outcomes like academic success or communicative ability. And because perceptions vary across cultures and contexts, judgements about face validity can differ widely.

This is not to suggest that face validity should be ignored—it should be acknowledged and examined because it tells us how a test is likely to be perceived and understood by stakeholders, and where misunderstandings around test communication may arise. But it should not be treated as the primary—or sole—basis for evaluating a language proficiency test.

Why face validity alone is not enough

A recent article I co-authored in the ELT Journal engages this very issue in the context of research on English language proficiency testing. In our Readers Respond piece, we note a common flaw in studies that lean heavily on stakeholders’ perceptions and draw broad conclusions about test quality based primarily on how tests are perceived rather than how they perform.

When research bases claims about an assessment’s effectiveness on face validity evidence alone, it risks overlooking the inferences that score users ultimately care about (Chapelle, 2012). Validity in testing is not a single property but a body of evidence supporting how test scores are interpreted and used. 

Perceptions matter, but they do not tell us whether scores predict academic success, align with instructional goals, or function consistently across populations. In our article, we discuss how relying solely on stakeholder impressions can lead to weak or misleading conclusions about a test’s validity, especially when those impressions are not triangulated with performance data or broader psychometric evidence. 

Building stronger validity arguments in language assessment

Stronger validity arguments for language assessments situate face validity within a broader framework of evidence. Face validity can signal that stakeholder groups recognize and accept a test’s relevance, but it does not demonstrate whether the test consistently measures the intended construct. 

Robust validity arguments draw on multiple sources, complementing surface-level impressions with rigorous research designs: data on test performance, relationships with external criteria, and evidence of how scores function across diverse populations and contexts. This approach respects stakeholders’ perceptions while maintaining scientific rigor. 

Face validity is a useful starting point. It helps ensure tests feel right to those who take and use them. But it shouldn’t be mistaken for the final verdict on test quality. Confident and fair decisions about language proficiency tools require evidence that goes well beyond the surface. 

If you’re interested in these issues, I invite you to check out our Readers Respond piece in the ELT Journal and join the conversation about what evidence really matters in language assessment.


Search