Lessons Learned on Evaluating Innovative School Models

November 6, 2015 Joel Rose and Chris Rush

There’s nothing that has animated our work more over the last five years than designing a model that can reliably deliver measurable learning gains for kids.

The most recent independent evaluation from Professor Douglas Ready at Teachers College, Columbia University showed that students across all partnership schools, on average, made gains 47% above the national average. But while these results were deemed “quite promising,” the study stressed that the student outcomes “could not be attributed to Teach to One without the use of an experimental or quasi-experimental design.”

Fair enough. That’s how research works.

But how does a model provider get to the point where they can demonstrate student gains attributable to their innovations in schools they don’t actually manage?

Over the last five years, we’ve gained a deeper understanding of the nuances and complexities of rigorous evaluation. We hope our experiences (and missteps) in articulating results can help inform future model designers, as well as the broader field, on how best to measure impact.

A Simple Question?

The core question in evaluating a model like Teach to One: Math is seemingly simple:

How did participating students perform in comparison to how they would have performed without the model?

But answering this question fairly, accurately, and transparently can be unpredictably complicated for several reasons.

1. Which evaluation instrument to use?

State tests focus on the standards for students’ assigned grade level. Personalized learning models like Teach to One: Math focus on meeting students where they are and enabling them to accelerate their learning from that point. That means that students may spend time during the school year working on pre- and post-grade skills and concepts.

As a result, grade-agnostic adaptive assessments like NWEA’s Measures of Academic Progress (MAP) are likely to reflect a more accurate picture of student growth than state assessments, particularly for students who enter well below or above grade level. However, state assessments based on grade-level standards are more widely used, are higher stakes, and are the basis for school (and teacher) accountability.

2. What exactly is the counterfactual?

Predicting how students would have performed in different learning environments is not easy. Comparison students with similar incoming attributes, such as proficiency levels or family income, are still learning at different schools with different teachers, curricula, and school contexts.

One way to establish a counterfactual in education evaluation is through randomization. Charter school lotteries, for example, provide a useful construct to measure the impact on students selected to a charter school to those who applied and were not admitted.

But when it comes to school models randomization can be trickier:

Often times, models are implemented for all students within a school. School leaders may be less inclined to randomize which students within a school receive treatment in a model and which students form the control group given the disruption and the questions it can raise with parents.

It is possible to randomize across schools by pre-selecting several schools and having researchers pair off similar schools and randomly selecting one of them for treatment. This requires districts with a significant number of willing schools to participate (there are only so many of those) and requires control schools to administer the same assessments and surveys as the treatment schools, often with little incentive to do so. It’s not an approach that’s easy to pull off.

Studies that use other ways to establish the counterfactual (e.g. Quasi-experimental design, use of national averages) may be easier to conduct, but are less likely to afford robust causal estimates.

3. What schools to count?

The design of school models can often be compromised when they meet the realities of real students, teachers, and schools. Schedules change. Staffing numbers change. Student enrollment fluctuates. And teachers may simply make choices in how they use the model that run counter to that model’s original design.

Some models may also enable schools to make choices that will impact overall results. We’re exploring ways to give partner schools the ability to determine where individual student learning progressions are aimed. Some will want to focus exclusively on academic growth; others on state tests; and still others may try to do a bit of both. Those options all have implications for overall results.

Disaggregating one group of schools from another can provide a more accurate picture of model effectiveness, but also can appear as cherry-picking those schools with optimal operating conditions.

4. When to do an evaluation?

We’ve been committed to evaluating and publicly disclosing the impact of our work from the very first implementation of School of One back in 2009. Nearly every program has come under some form of third-party evaluation in one way or another, and the transparency of doing so has helped us to better understand our impact while building trust with key stakeholders (schools, funders, etc.).

But we’ve also learned that opening our work up to public, third-party evaluations early in the development of the model resulted in stale recommendations and some public relations risk. A major focus in our earliest years was on logistics and operations, as we tried to understand what it would take to bring personalized learning to life.

We used feedback from participating teachers, from our own observation, and from formative assessments as the basis for iteration. The quantitative results were less helpful because we had iterated on the model several times throughout the evaluation period. And qualitatively, by the time third-party evaluations were completed, most of the recommendations included were old news and had already been addressed. As the design of Teach to One: Math continues to evolve, this remains true.

Sadly, some early evaluations can be damaging. Back in 2012, the Research Alliance for New York City Schools published an evaluation after one year of implementation of the first in-school implementation of School of One. The evaluators completed a thoughtful evaluation, highlighting the fact that, “given the early stage of [School of One]’s development and implementation and the limited number of schools that have piloted the program, this evaluation cannot reach definitive conclusions about [School of One]’s effectiveness.”

The NY Daily News had no interest in the caveats, and instead published an article giving Joel Klein an “F” for one of his signature accomplishments. Several supporters published pieces to rebut the article, in Forbes, Education Week, and NY Daily News, including the study’s original author on WNYC. But some journalists continue to cite the article to this day when writing about our work even though there is far more current evaluation data available.

Rigorous, third-party evaluation is critical to understanding what works. But model designers, funders, and policymakers should understand the implications of third-party evaluations done too early.

The Evolution of Results Reporting

Since launching New Classrooms, we’ve opted to publish annual reports (2013 Annual Report and 2014 Annual Report) with the state test scores and MAP scores for every grade in every school we serve. That’s in addition to the studies from Professor Ready at Teachers College from 2012-13 and 2013-14 which compared MAP gains to students nationally with similar starting points.

As we move into the next phase of our growth, we are beginning to explore how best to measure our impact and report on results. Here’s our current thinking:

We are currently in the midst of a three-year study as part of our most recent i3 grant with Elizabeth Public Schools that will evaluate the effect of Teach to One: Math on student achievement. The study will use a Quasi-Experimental research design instead of an RCT.

We’ll continue to report all state test and MAP data for every school in the network that uses Teach to One: Math as their core academic program.

We’ll also continue to aggregate MAP scores across the network, though we are exploring how best to do so in ways that are focused on schools whose implementations are designed to optimize student growth regardless of assigned grade level.

We are still trying to figure out whether and how best to aggregate results on state tests across the network, and would welcome feedback you may have along those lines. Feel free to send your ideas to info@newclassrooms.org.