Combined Measures Beat Single Measure
Another finding that emerges from research on screening is that approaches that combine several measures have better screening accuracy than approaches based on single measures. Both Foorman et al. (1998) and O'Connor and Jenkins (1999) found improve classification accuracy from combining scores on several measures.
Validity-Types of Evidence
An efficient screening procedure is one that obtains significant information about many individuals in very little time. The efficiency requirement leads screening researcher to identify reading-related traits, design brief assessments of those traits, and evaluate their potential utility as screens. Screening researchers report two types of validity evidence: criterion validity and classification accuracy.
Criterion Validity. Studies of criterion validity correlate performance on the candidate screening measure with performance on an established reading measure, administered either concurrently and/or at a future time. The strength of the correlation between the candidate measure and established the measure provides evidence on the new measure's validity. But simply documenting a relationship between a screening measure and later reading ability, or even documenting that a particular trait accounts for unique variance in reading ability provides at best weak evidence regarding the utility of a measure for screening purposes. In fact, correlations between screening measures and criterion tests need not be particularly high as long as the screening measure distinguishes between individuals who perform poorly from those who perform satisfactorily on the criterion measure. Including more challenging measurement content (e.g. word-level or text-level tasks) could serve to distinguish between middle, strong, and very strong readers on a future criterion test (thereby increasing the correlation between the new and the criterion measure), but unless the challenging items are particularly sensitive to individual differences in the low and middle skill ranges (i.e., distinguish between poor and satisfactory performers), including them is not an improvement.
Classification Validity. The strongest evidence for the validity of a screening measure derives from classification analyses. These analysis determine the accuracy (sensitivity and specificity) of a screen in distinguishing between students who perform satisfactorily on a future criterion measure from those who do not. Criterion validity studies are informative for identifying measures that hold potential as screens, but classification studies, however, are the sine qua non of screening research.
Speece's et al. (2003) illustrates the relative value of classification analysis over correlational or regression approaches in evaluating the measurement content of screening tests. They examined the utility of several potential end-of-kindergarten screening measures in relation to several end-of-Grade 1 criterion measures, including Woodcock-Johnson (WJ) Letter-Word Identification (Word ID), WJ Word Attack (WA), or oral reading fluency. Kindergarten phonemic awareness (phonemic blending and phonemic elision) accounted for significant variance in all three criterion measures, but its sensitivity was poor, ranging from 43% to 67%. In the best case, phonemic awareness failed to identify 33% of students who performed below criterion at the end of first grade (i.e., false negatives).
Cut-Points and Cross-Validation.
Designing effective screening tools is an empirical process. The task of the researcher is to determine whether cut-points can be found on the screen that clearly distinguish between satisfactory and unsatisfactory outcomes on the criterion measure. It is accomplished by working backward from the criterion measure to the screening measure. There is always a trade-off between sensitivity (reducing false negatives) and specificity (reducing false positives). Because intervention researchers place a greater premium on sensitivity than on specificity, they select cut-points to limit false negatives. This was the strategy used by Foorman et al. (1998) and O'Connor and Jenkins (1999). The post hoc nature of the process guarantees acceptable sensitivity. Problems arise when attaining sensitivity produces unacceptable specificity (i.e., too many false positives, or over-identification). Because selecting cut points in a post-hoc fashion in effect guarantees desirable levels of sensitivity, this approach usually produces higher sensitivity levels than approaches where cut-point are selected more arbitrarily (e.g., using the 25th percentile of the screening measure to divide risk status). Speece and Case (2001) and Speece et al. (2003) employed the latter approach. These different approaches to marking cut-points must be considered when comparing the sensitivity of different screening measures.
The danger in generalizing sensitivity and specificity results from studies using a post-hoc procedure for setting cut-points is that the cut-points may not hold up in a cross validation. When O'Connor and Jenkins (1999) applied cut-points derived from their first cohort to subsequent cohorts, they did not achieve the same level of sensitivity as they had obtained with their first cohort. Caution is warranted in using cut-scores that have not been cross-validated.
Previous Page | Next Page
(How at Risk) | (Promising Measures)

