Single-response and multiple-choice response formats on a sixth-grade mathematics test: Evidence of construct validity

Date of Completion

January 1996


Education, Mathematics|Education, Tests and Measurements|Education, Elementary|Education, Educational Psychology




This study investigated evidence of construct validity for single-response (SR) and multiple-choice (MC) response formats on a sixth-grade mathematics test. Two test forms consisting of 20 stem-equivalent items using either a single-response or multiple-choice response format were created to measure computation, pattern recognition, one- and two-step problem solving and geometry. There were 1680 Connecticut public school sixth-grade students that were administered a test form on each testing occasion. The test forms were randomly administered to the examinees by spiraling the test forms into each classroom. The counterbalanced test-retest design resulted in the creation of four equivalent groups of examinees.^ It was found that the stem-equivalent items were more difficult and discriminating when the SR format was used. However, there was no difference in the estimated coefficient of reliability between the two test forms. The analysis of the items based on item response theory (IRT) indicated that the two forms were essentially unidimensional, required only minimal guessing and fit a two-parameter logistic IRT model. It was also shown using confirmatory factor analysis that the two forms fit the hypothesized 4-factor structure which was based on the four content areas.^ The correlational analyses indicated that: (a) the SR format more consistently rank-ordered examinees, (b) item stability tended to be stronger when the same response format was used but tended to be a function of content, and (c) there were no differences between the correlations of the response formats and the three external measures of student achievement.^ The Generalizability study results indicated that the total scores for the two response formats were stable. Additionally, there was evidence of convergent validity across the four objective-level scores. The Multitrait-multimethod analyses provided further evidence of construct validity. That is, there was significant evidence of convergent validity, and a sufficient lack of any evidence of discriminant validity or method variance.^ Finally, there were a number of items that functioned differentially between response formats. That is, the comparison of the item characteristic curves for two subgroups of examinees indicated that differences in subgroup performance could be attributed to the response format used. ^