Implications of Small Samples for Generalization: Adjustments and Rules of ThumbReport as inadecuate

Implications of Small Samples for Generalization: Adjustments and Rules of Thumb - Download this document for free, or read online. Document in PDF available to download.

Society for Research on Educational Effectiveness

Policy-makers are frequently interested in understanding how effective a particular intervention may be for a specific (and often broad) population. In many fields, particularly education and social welfare, the ideal form of these evaluations is a large-scale randomized experiment. Recent research has highlighted that sites in these large-scale experiments are typically not randomly sampled from the population, making generalizations difficult. A problem not addressed by this literature is the effect of "small" sample sizes in generalization. This paper addresses three questions regarding the effect of small sample sizes on: (1) assessments of generalizability; (2) rules of thumb for covariate balance; and (3) properties of estimators and estimation strategies. The authors compare results from rare-events logistic progression (RE) and standard-logistic regression to determine if and when small sample corrections matter. This study investigates these issues in relation to sample sizes that vary from 30 to 70 clusters and on studies that are cluster-randomized or multi-site (random block) in design. Data examined were drawn from a cluster randomized controlled trial (Konstantopoulos, Miller, and Van der Ploeg, 2013) that was designed to study the effect of Indiana's benchmark assessment system on student achievement in mathematics and English Language Arts (ELA) based on annual Indiana Statewide Testing for Educational Progress-Plus (ISTEP+) scores. Fifty-six K-8 schools volunteered to implement the system in the 2009-10 school year. Of these, 34 were randomly assigned to the state's benchmark assessment system while 22 served as controls. Data from the experiment were supplemented by data on all of the other K-8 schools in the state of Indiana, which were used to define the inference population. Based on simulation results, findings include: (1) The standardized mean difference (|SMD|) for the RE-logits were typically much smaller than those for the logits, and more importantly, were in line with the |SMD|s for the individual covariates; (2) The degree of imbalance between a sample and population is much larger under random-sampling than would be expected by the rules of thumb commonly in place in propensity score methods; and (3) The problem of small sample sizes limiting the number of equal-populations strata possible in generalization is likely to arise simply by a change in random samples. Propensity score matching methods can be used to improve generalizability of findings from randomized experiments with non-probability samples, but adjustments and new rules of thumb are necessary in the application of these methods in this context. Tables and figures are appended.

Descriptors: Generalization, Program Effectiveness, Sample Size, Computation, Evaluation Methods, Regression (Statistics), Comparative Analysis, Randomized Controlled Trials, Benchmarking, Student Evaluation, Mathematics Achievement, Language Arts, Academic Achievement, Reading Achievement, Scores, Elementary Secondary Education, Scoring

Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; Fax: 202-640-4401; e-mail: inquiries[at]; Web site:

Author: Tipton, Elizabeth; Hallberg, Kelly; Hedges, Larry V.; Chan, Wendy



Related documents