Skip to main content

Test-Retest Reliability

The correlation between scores when the same person takes the same test on different occasions, indicating measurement stability.

Test-retest reliability is a psychometric index expressed as the correlation coefficient (typically Pearson's r or intraclass correlation coefficient ICC) between scores obtained when the same individual completes the same test at different time points. Values approaching 1.0 indicate stable measurement that reflects true individual ability rather than random error. For cognitive tests, r = 0.70 is considered the minimum for practical utility, 0.80+ is good, and 0.90+ is excellent. Practice effects and diurnal variation are primary factors that can reduce apparent reliability in cognitive assessment.

Statistical Foundation

Test-retest reliability directly corresponds to the reliability definition in Classical Test Theory (CTT). CTT formalizes observed scores as the sum of true score plus error: X = T + E. The reliability coefficient represents the proportion of observed score variance attributable to true score variance. In the test-retest method, the correlation between two administrations estimates this reliability coefficient. However, measurement interval introduces complications: too short an interval inflates correlation through memory effects, while too long an interval deflates it through genuine ability changes. For cognitive tests, intervals of 1-4 weeks are generally recommended to balance these competing concerns and provide accurate reliability estimates.

Factors Affecting Reliability

Multiple factors reduce cognitive test reliability. Practice effects are the most common concern, particularly the substantial improvement from first to second administration as participants learn task strategies and reduce novelty-related anxiety. Diurnal variation (circadian arousal fluctuations), day-to-day variation (differences in sleep quality, stress, caffeine intake), and motivational fluctuations all introduce error variance. Insufficient trial counts increase measurement error through inadequate sampling of the performance distribution. Test length and number of trials are directly related to reliability through the Spearman-Brown prophecy formula: doubling test length increases reliability predictably. Standardizing testing conditions across sessions minimizes extraneous variance.

Application to Bench Score Interpretation

Correctly interpreting Bench results requires understanding each test's reliability characteristics. Reaction time tests generally show high reliability (r = 0.80-0.90) when using the median of 20+ trials, but single-trial scores have low reliability and poorly represent individual ability. To determine whether a score change represents genuine improvement versus measurement noise, the standard error of measurement (SEM) derived from the reliability coefficient provides the critical threshold. Changes exceeding 1.96 times the SEM can be interpreted as true change with 95% confidence. For a test with reliability 0.85 and SD of 30ms, the SEM is approximately 11.6ms, meaning changes smaller than 23ms cannot be reliably distinguished from random fluctuation.