Psychometric Considerations

Psychometrics is the science of psychological assessment. A primary goal of EdInstruments is to provide information on crucial psychometric topics including Validity and Reliability – essential concepts of evaluation, which indicate how well an instrument measures a construct - as well as additional properties that are worthy of consideration when selecting an instrument of measurement.

Validity is a topic of great importance that is often misunderstood, but broadly assesses the accuracy of a measure. In the context of measurement, validity is the extent to which theory and evidence support the intended interpretations of scores for proposed uses.

There are several types of validity or measures of accuracy to consider when evaluating an item for use. These include:

  • Content validity - to what extent does an item measure facets of a given construct?
  • Convergent validity - to what extent do scores that should in theory be correlated actually correlate?
  • Discriminant validity - to what extent do measures identify unique constructs from other measures?
  • Predictive and concurrent validity - to what extent do measures predict a future (or concurrent) measure or outcome?

The greatest threats to validity include weak or limited content or construct coverage and construct-irrelevant variance. Questions one may consider to examine this concept include:

  • Is the instrument content supported by theory and prior research?
  • Was there expert review to support the content coverage?
  • Is there evidence that instrument items were reviewed by experts for sensitivity to gender, race/ethnicity, linguistic complexity, culture, or other characteristics relevant to your intended uses?

Often weak content and construct validity can result in biased assessments if, for example, a word problem assumes cultural knowledge but fundamentally asks about a mathematical construct. Students should not receive higher or lower scores for reasons beyond the knowledge or skills being assessed (Pentimonti et al., 2019b). Users should examine whether information is available on differential item functioning or measurement invariance.

When determining whether an instrument is valid for your desired use, the instrument should have been previously validated using a sample that is representative of your population of interest (Pentimonti et al., 2019a). At a high level, instruments can only be supported by validity evidence for particular interpretations and uses with particular populations. Even though prior research supports the use of a given measure, adapting it for a new group of students or a different purpose could render that research moot, or at least make it less relevant.

Reliability evaluates the consistency of a measure. Developers will often share statistical reports on the different types of reliability of their measure. These measures of reliability include:

  • Internal consistency reliability - how related are individuals’ responses across items on a measure? A researcher will often report a Cronbach’s alpha that broadly examines how correlated individual responses are across items that theoretically capture a similar construct.
  • Test-retest reliability - how stable are individual responses to items over time? A researcher may want to measure whether a construct changes over time and ensure any differences in pre and post measures are solely attributable to events occurring between administrations (e.g., an intervention) rather than natural noise in the measure. A measure of an adult’s height should have strong test-retest reliability - measures taken a year apart will yield consistent measures.
  • Inter-method or “parallel forms” reliability - how stable are individual responses to item groups across formats? For example, you might want to measure a similar construct on a pre and post survey, but not want to replicate the exact same questions. A measure of two sets of questions inter-method reliability will indicate whether the two sets of items will reliably measure the same underlying construct.
  • Inter-rater reliability - how similarly will two (or more) observers rate an observed behavior or response? For example, if you wish to evaluate classroom interactions, you want to make sure different classroom observers will code similar behaviors similarly and any differences in observation scores are a function of different classroom interactions and not different individuals’ interpretations of interactions.

In addition to evaluating an instrument’s reliability and validity, users must consider the intended purpose of the instrument (i.e., for whom the instrument is intended to be used, the proposed purposes, and contexts in which the instrument is used, the implications for students, who scores should be shared with, and the spectrum of positive and negative consequences of these decisions) and whether this matches the intended use of their eventual user. These considerations fall under the category of “consequential evidence” or an emphasis that whatever the technical qualities of the measure, ensuring that it leads to beneficial outcomes for students, or at least does no harm, should be paramount.

Evidence should, at a minimum, include differential item functioning (DIF) analyses, or better yet, measurement invariance or multigroup confirmatory factor analyses.

Questions one may consider to examine this concept include:

  • What is the intended age range?
  • Is there an intended (sub-)population under consideration?
  • Is the definition of the construct being measured consistent with your intended interpretation and support your use?
  • Is there evidence that the measurement model parameters are consistent across groups, i.e., that there is no bias across gender, English learner (EL) versus non-EL students, students of different race or ethnicity?
  • Is there evidence that scores from the instrument are associated (correlated) with other important variables?

Practitioners use assessments for three primary purposes:

- Screening students to identify those who may need additional support;
- Diagnosing specific skill strengths and deficits to inform instructional approach, either for a full class or for students needing supplemental and more intensive support; and
- Progress monitoring to assess rate of progress and make decisions about when a change in instructional approach is needed, for students receiving supplemental and intensive support.

When considering whether to use a given instrument, practitioners should keep the following considerations in mind.

Screening Assessments Are Not Diagnostic
As schools seek to reduce the amount of time students spend being assessed in favor of instruction, there is often a desire to select and use efficiently administered assessments for multiple purposes. Most commonly, this occurs when screening assessments are adopted for diagnostic purposes because of their purported alignment with curriculum and use. However, screening and diagnostic assessments are by definition constructed in opposition to one another. It is important for practitioners to use assessments aligned with their intended purpose.

Accessibility and Accommodations
School practitioners should consider issues of accessibility and accommodations for students being assessed. Some students may require additional time, for example. For timed mathematics tests, however, applying an extra-time accommodation may compromise the integrity of the scores. Similarly, for tests in which the tested construct is students’ ability to perform computations, calculators may not be appropriate. Educators will need information about accommodations when determining which accommodations can be provided to still maintain the integrity of the results. Some assessments specify the accessibility features available (e.g., items are read aloud by the assessor and the examinee may respond with gestures such as pointing, which can permit scoring of responses from individuals who may have limited expressive abilities). When assessing students with low-incidence disabilities (e.g., visual impairment, hearing loss, significant cognitive impairment) and moderate to severe display, extensive modifications may be needed to permit established assessments to be accessible to these students, which may render assessment scores of little value.

Timing and Scope of Criterion Measures
School systems often seek to be more efficient in their administration of assessment measures. As a result, there is often a drive to adopt a single assessment system that spans as many grade levels as possible. However, not all measures have strong reliability and validity measures across grades, and as a consequence the scores from those measures may not produce sufficient evidence for the school’s needs. For example, frequently assessment systems that span early elementary school and later elementary school or even middle school have not been validated at each grade level.

If you’re interested in learning more about validity of the scoring approach please see the study cited below:

Kuhfeld, M., & Soland, J. (2020). Avoiding bias from sum scores in growth estimates: An examination of IRT-based approaches to scoring longitudinal survey responses. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000367