Psychometric Considerations

Introduction

Psychometrics is the science of psychological assessment. A primary goal of EdInstruments is to provide summary information on measures and assessments so that you – the potential user of a given tool – are equipped to locate further detail to determine whether the tool is appropriate for your context and intended use case. Psychometric topics – including understanding the evidence available to support valid and reliable interpretations of an instrument’s results for a particular intended use – are essential to consider when evaluating how well an instrument is likely to measure a construct. Please review the information below for more information on how validity and reliability are evaluated, as well as additional properties worthy of your consideration when selecting instruments. Much of the information presented below draws from the 2014 Standards for Educational and Psychological Testing, which are available free of charge at: https://www.testingstandards.net/.

Validity

Validity is a topic of great importance that is often misunderstood. In the context of measurement, validity can be thought of as the extent to which theory and evidence support the intended interpretations of scores for their proposed uses.

Among the most important points to keep in mind regarding validity is that validity evidence is always tied to a specific interpretation of assessment results for an intended use. Per the 2014 Standards (p. 11), “It is incorrect to use the unqualified phrase ‘the validity of the test’”. That is, validity is not a property of instruments themselves. Instead, we build a unified argument for the valid application of an instrument’s scores that is supported by multiple sources of evidence (described below).

When determining whether an instrument produces results that are valid for your desired use, we recommend conducting validation studies of your specific target population. If that is not possible, it is ideal to verify that previous validity evidence has been generated using one or more samples representative of your target population. Even if prior research supports the valid and reliable use of a measure’s results in a particular context, adapting it for a new group or purpose requires additional validation to evaluate whether its results remain consistent in that new context.

The Standards for Educational and Psychological Testing emphasize that validity is a unitary concept, with multiple sources of evidence contributing to an overall judgment about the validity of a measure’s results. These sources of evidence include:

Evidence Based on Test Content: This source of evidence examines the relationship between the content of the test and the construct it is intended to measure. To what extent does the test measure facets of the construct as defined or expected by prevailing theory? For example, is the content of the items aligned with the skills or knowledge areas they aim to assess? Expert review is often crucial to evaluate whether an instrument’s content reflects the intended construct comprehensively and accurately.
Evidence Based on Response Processes: This source of evidence evaluates the cognitive, emotional, or behavioral processes that respondents engage in when answering the items. Are the processes elicited by the test consistent with the intended construct? For example, if a math test requires reading complex language, the validity of its results may be compromised for individuals with limited language proficiency. Another example might be a survey instrument designed to evaluate students’ self-efficacy. If the survey is written in such a way that students from different backgrounds understand a given item quite differently from each other, their responses and resulting self-efficacy scores may not be comparable.
Evidence Based on Internal Structure: This source of evidence examines how well the relationships among responses to test items align with the construct’s theoretical structure. Has the assessment developer conducted a factor analysis or other type of dimensionality study to reveal that an instrument’s items cluster together in ways reflecting the intended subdomains or facets of the targeted construct(s)? For example, do items measuring a broader construct such as student Engagement group together as expected into multiple subskill factors (e.g., Behavioral, Cognitive, Emotional)? What evidence is provided to demonstrate that an instrument’s empirical results reflect its underlying conceptual framework?
Evidence Based on Relations to Other Variables: This type of evidence evaluates the extent to which test scores are related to other measures or outcomes in theoretically expected ways. This encompasses:
- Convergent and divergent evidence: Convergent and divergent evidence is evaluated by assessing the degree to which a measure’s scores are related to theoretically similar (convergent) or different (divergent) constructs. This is typically done using statistical approaches such as estimating correlation coefficients, regression analyses, etc. For example, scores on a new measure of mathematical reasoning might be expected to correlate more strongly with established math achievement tests (convergent evidence) and less strongly with measures of reading comprehension.
- Test-criterion relationships: This refers to the degree to which test scores are related to relevant external criteria, usually either concurrently (i.e., at approximately the same time) or predictively (i.e., at a future time). For example, an assessment designed to measure competencies indicative of a high school student’s readiness for college might be expected to show predictive evidence of its scores correlating strongly with subsequent college GPA or graduation rates. Concurrent evidence might involve demonstrating a strong correlation between the college readiness measure and the student’s high school grades.

Threats to a measure’s valid interpretation can include the measure exhibiting limited content or construct coverage, and its results containing construct-irrelevant variance. For instance, interpretations of an instrument’s results are called into question when its items inadvertently measure skills, knowledge, or contextual factors beyond the intended constructs (e.g., requiring cultural knowledge to respond correctly to math problems).

Here are some questions to consider when evaluating the evidence presented to support valid interpretations of an instrument’s results for a given use with a specific population:

Is the instrument’s content supported by theory and prior research?
Was there expert review to ensure comprehensive content coverage?
Is there evidence that items were reviewed for sensitivity to characteristics such as gender, race/ethnicity, culture, linguistic complexity, or other relevant factors?

Potential instrument users should seek evidence that the measure functions equitably across diverse groups. This is usually evaluated via statistical approaches to examine differential item functioning or measurement invariance across groups of interest.

Reliability

Reliability can be defined as the degree to which an instrument’s scores are free of random measurement error for a given group. Evidence for reliability provides information about the consistency of a measure. For example, one might evaluate the consistency of scores across repeated administrations or among multiple sources of information (e.g., raters) on the same measure. Assessment developers often share statistical reports on the different types of reliability for their measures. These types of reliability coefficients include:

Internal Consistency: This examines how strongly related individuals’ responses are across items within a measure. Researchers often report Cronbach’s alpha, which assesses the correlation among items intended to measure the same construct. A high value of Cronbach’s alpha(α) suggests a high proportion of common variance among the items. You may also see other measures of internal consistency such as omega (ω) or rho (ρ) reported by instrument developers.
Test-Retest: This evaluates the stability of test scores over time. Researchers generate this type of evidence by administering the same instrument in the same population at multiple time points. For example, a measure of an adult’s height should exhibit very high levels of test-retest reliability, yielding consistent results over time.
Alternate-Forms: This assesses the consistency of scores across different forms or versions of the measure. For example, using two different forms with theoretically equivalent item sets to administer a pre-test and post-test. The item-level functioning of alternate forms can also be evaluated using approaches such as Item Response Theory or Factor Analysis.
Inter-Rater: This examines the extent to which different observers or raters produce consistent scores when evaluating the same behavior or response using the same measure. For example, if you are assessing classroom interactions via direct observations, strong inter-rater reliability demonstrates that differences in scores between raters primarily reflect actual differences in observed interactions rather than inconsistencies between raters.

For all types of reliability, higher versus lower values indicate that the measure produces more versus less consistent results. However, reliability statistics are not sufficient on their own to justify the interpretation or use of a measure’s results – they should be evaluated alongside evidence for validity to argue convincingly that a measure’s results are both consistent and appropriate for their intended use.

Intended Interpretations and Uses

In addition to evaluating evidence for the validity and reliability of a measure’s scores, potential users should consider their intended interpretations and uses carefully. Examples of issues to consider include identifying the target population, proposed purposes and contexts for use, and potential implications or consequences of score interpretations (and any decisions made on their basis).

Key questions to examine include:

What is the intended age range? Ensure the instrument’s content and results reporting are aligned with the expected developmental and cognitive characteristics of the population it is intended to assess.
Is the definition of the construct being measured consistent with your intended interpretation and use? Confirm that the construct aligns with the theoretical framework and practical application for your context.
Is there a specific intended population or sample under consideration? Determine whether there is evidence that an instrument produces valid and reliable results for the specific group or groups with which you intend to use it.
Is there evidence of fairness across groups? Examine whether measurement model parameters are consistent across demographic groups, indicating the instrument functions similarly among people from different backgrounds. Examples of such groups might include people from diverse gender, racial or ethnic backgrounds, individuals with disabilities, and English learners.
Are scores from the instrument associated (correlated) with other important variables? Are scores predictive of future student success? Do they relate to other similar or different instruments as intended?

The 2014 Standards emphasize that decisions about testing should also include consideration of the consequences of test use. This includes both intended positive outcomes such as improved instruction or equitable opportunities, and potential negative outcomes such as unintended biases or harm to specific groups of test-takers.

Analyses such as those designed to detect differential item functioning (DIF) and measurement invariance can provide critical evidence about whether an instrument’s scores function equivalently across groups. When such evidence is lacking, there is a greater risk that an instrument’s scores could unfairly advantage or disadvantage people from a particular background.

Additional Considerations for Use by School Practitioners

Practitioners use assessments for three primary purposes:

Screening students to identify those who may need additional support;
Diagnosing specific skill strengths and deficits to inform instructional approach, either for a full class or for students needing supplemental and more intensive support; and
Progress monitoring to assess rate of progress and make decisions about when a change in instructional approach is needed, for students receiving supplemental and intensive support.

When considering whether to use a given instrument, practitioners should keep the following considerations in mind.

Screening Assessments Are Not Diagnostic

As schools seek to reduce the amount of time students spend being assessed in favor of instruction, there is often a desire to select and use efficiently administered assessments for multiple purposes. Most commonly, this occurs when screening assessments are adopted for diagnostic purposes because of their purported alignment with curriculum and use. However, screening and diagnostic assessments are by definition constructed in opposition to one another. It is important for practitioners to use assessments aligned with their intended purpose.

Accessibility and Accommodations

School practitioners should consider issues of accessibility and accommodations for students being assessed. Some students may require additional time, for example. For timed mathematics tests, however, applying an extra-time accommodation may compromise the integrity of the scores. Similarly, for tests in which the tested construct is students’ ability to perform computations, calculators may not be appropriate. Educators will need information about accommodations when determining which accommodations can be provided to still maintain the integrity of the results. Some assessments specify the accessibility features available (e.g., items are read aloud by the assessor and the examinee may respond with gestures such as pointing, which can permit scoring of responses from individuals who may have limited expressive abilities). When assessing students with low-incidence disabilities (e.g., visual impairment, hearing loss, significant cognitive impairment) and moderate to severe display, extensive modifications may be needed to permit established assessments to be accessible to these students, which may render assessment scores of little value.

Timing and Scope of Criterion Measures

School systems often seek to be more efficient in their administration of assessment measures. As a result, there is often a drive to adopt a single assessment system that spans as many grade levels as possible. However, not all measures have strong reliability and validity measures across grades, and as a consequence the scores from those measures may not produce sufficient evidence for the school’s needs. For example, assessment systems that span early elementary school and later elementary school or even middle school have often not been validated at each grade level.

Do You Want to Learn More?

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. Available at: https://www.testingstandards.net/

Flake, J. K., & Fried, E. I. (2020). Measurement schmeasurement: Questionable measurement practices and how to avoid them. Advances in Methods and Practices in Psychological Science, 3(4), 456-465. Available at: https://journals.sagepub.com/doi/10.1177/2515245920952393