Statistical Foundation
Test-retest reliability directly corresponds to the reliability definition in Classical Test Theory (CTT). CTT formalizes observed scores as the sum of true score plus error: X = T + E. The reliability coefficient represents the proportion of observed score variance attributable to true score variance. In the test-retest method, the correlation between two administrations estimates this reliability coefficient. However, measurement interval introduces complications: too short an interval inflates correlation through memory effects, while too long an interval deflates it through genuine ability changes. For cognitive tests, intervals of 1-4 weeks are generally recommended to balance these competing concerns and provide accurate reliability estimates.
Factors Affecting Reliability
Multiple factors reduce cognitive test reliability. Practice effects are the most common concern, particularly the substantial improvement from first to second administration as participants learn task strategies and reduce novelty-related anxiety. Diurnal variation (circadian arousal fluctuations), day-to-day variation (differences in sleep quality, stress, caffeine intake), and motivational fluctuations all introduce error variance. Insufficient trial counts increase measurement error through inadequate sampling of the performance distribution. Test length and number of trials are directly related to reliability through the Spearman-Brown prophecy formula: doubling test length increases reliability predictably. Standardizing testing conditions across sessions minimizes extraneous variance.
Application to Bench Score Interpretation
Correctly interpreting Bench results requires understanding each test's reliability characteristics. Reaction time tests generally show high reliability (r = 0.80-0.90) when using the median of 20+ trials, but single-trial scores have low reliability and poorly represent individual ability. To determine whether a score change represents genuine improvement versus measurement noise, the standard error of measurement (SEM) derived from the reliability coefficient provides the critical threshold. Changes exceeding 1.96 times the SEM can be interpreted as true change with 95% confidence. For a test with reliability 0.85 and SD of 30ms, the SEM is approximately 11.6ms, meaning changes smaller than 23ms cannot be reliably distinguished from random fluctuation.