Generalizability and dependability of behavioral assessment methods: A comparison of systematic direct observation and Direct Behavior Ratings

Date of Completion

January 2009


Education, Tests and Measurements|Education, Educational Psychology|Psychology, Psychometrics




Given increased focus on accountability, there is no question that the assessment tools used to gather information about student behavior need to be technically sound. However, although much attention has been paid in the literature to assessing the psychometric properties of academic assessment instruments (e.g., achievement tests, CBM), examination of behavioral assessment methods is still in its relative infancy. Systematic direct observation (SDO) has generally been regarded as the "gold standard" in behavioral assessment; however, it has been suggested that use of Direct Behavior Rating (DBR) may be a more feasible way to collect adequate amounts of formative data. A need has therefore emerged to conduct an informed weighing of the psychometric benefits and practical utility of these existing methods in order to better inform decisions regarding method usage. Generalizability theory (GT) was therefore used in the current study to determine and compare the number of DBR and SDO recordings needed in order to obtain a dependable estimate of student behavior. ^ Participants included 2 teachers and 12 students within the same kindergarten classroom in the Northeastern United States. Individual student engagement levels were simultaneously rated by 2 teachers using DBR and 2 external observers using SDO. GT was then used to estimate the proportion of rating variance attributable to the facets of person, rater, day, and rating occasion, as well as any relevant interactions. Although both methods were found to be equally sensitive to intra-individual differences in academic engagement, differences were noted with regard to the influences of both rater and time on SDO and DBR data. Analyses conducted using SDO data revealed that a minimal percentage of rating variability was attributable to differences between raters, whereas a substantial proportion of rating variance was explained by changes in student behavior over time. In contrast, a large proportion of the variance in DBR recordings was attributable to rater-related effects, suggesting the need for training if DBR data are to be examined across raters. Both limitations of the current study, as well as recommendations for research and practice, are discussed. ^