Eyes Versus Numbers
Posted: September 20th, 2012 | Author: Michael Goldstein | | 5 Comments »
Look at these numbers.
• The average number of X is down from 15.2 in 2011, versus 14.7 in 2012.
• The average number of Y is 75 this year, 74 last year.
• Z is 23 this year and 21 in 2011.
• ZZ is up from 48 to 51.
So what do you think? Is there a large shift? Or do the numbers look about the same.
To me, they look about the same.
Now I’ll tell you what the numbers are: NFL officiating.
• The average number of penalties per game is down from 15.2 to 14.7.
• On player-safety calls, such as roughing the passer; unnecessary roughness, including hitting defenseless players; and face-mask or horse-collar violations, the calls are nearly even: 75 this year, 74 last.
• Instant replay reviews are way up, an increase of 16. But the percentage of reversals is way down: 23 this year out of 62 as opposed to 21 of 46 in 2011.
• Defensive pass interference and illegal contact penalties are up, but only from 48 to 51, surprising because of the hubbub raised on the airwaves about the lack of such calls.
I find this interesting.
NFL referees have been locked out by the league. So this year there are replacement refs.
The narrative among players, coaches, and fans is that the replacement refs are missing a lot of calls. But the replacement refs seem to issue penalties at roughly the same rate.
In fact, on this metric — how many instant judgements are overturned later by video replay — the replacement refs seem to be doing better than the regular refs. Their calls are reversed less often.
So who is right? The numbers or the observers?
This is relevant to whether teachers should be measured, in part, based on student gains on tests. The Gates-funded MET studies showed that trained adult observers — using various scorecards — were pretty bad at “evaluating” teachers.
Or more specifically, these observers struggle to tell the difference between teachers whose kids do very well on the exams, and teachers whose kids do quite badly. The numbers and the “eyes” don’t line up well.

Yeah, but some calls are not reviewable (like penalty flags), and non-calls have an impact on games too. In fact, there is much outrage about how games are getting out of control because of non-calls and bad calls.
As a teacher, less demerits and detentions may not yield a better result in the end. Or perhaps a better way to look at it is to think about whether the RIGHT calls are being made. If too many holding calls are being made (and anyone who knows football knows that there is holding on every single play), but they aren’t calling out face-masks, the total number of penalties will be the same.
I think the real issue at this point is that the sample size is too small. You can’t judge a classroom teacher’s overall success by what happens in September; same is true with the refs.
Here’s a question about the Gates study: Did they just look at whole group instruction? Or did they look at things like tutoring, one-on-one non-academic relationship building, other interactions, as well as planning, use of data, etc? if they just looked at whole group, then it’s pretty easy to spot the issue…great teaching is about all those things in concert.
Mike,
My understanding of the research–not just the MET studies–is that observers’ ratings correlate fairly highly with VAM at the two tails of the distribution (very high performers and very low performers) but not well in the middle. But then again, I’m not up on all the latest studies. VAM itself is not good at distinguishing between middle of the distribution teachers, as well, since the scores are very unstable outside the two tails.
The question of which is right (numbers or observers) is not a simple one. Any bit of data about teaching and learning (test score or observational score) is a very small sample upon which many people too often make unreasonable inferences. 30-40 items on a test serves as a proxy for very broad domain of understanding.
I think in addition to thinking about the limitations of observations–and there are may–we also need to remember the limitations of testing. I heard from a friend that an education reformer of note arranged to observed the classes of a few of the organization’s teachers with the very highest VAM score. What s/he saw really shocked her. A couple of the classes were truly, s/he felt, utterly joyless places. I know at MATCH, you do think a lot about the importance of joy. But of course, it’s not something that can show up in VAM.
Hi Paul,
The MET studies by Gates did not look at “Tutoring, one-on-one non-academic relationship building, other interactions, as well as planning, use of data.” Just normal classroom.
Does Brooke observe you doing the other teacher tasks? While I know of several schools that review teacher lesson plans, I don’t know of any that actually observe teachers providing the tutoring or doing the 1-on-1 relationship building.
Hi Ed,
The MET teacher observation correlations were in the 0.2 range to VAM. And lower, often, depending on rubric. The Ferguson “student surveys” were closer to 0.4 correlation with VAM, which surprised many.
Nah, of course not, at least not formally. Do you guys do that? It seems like tutoring is such a huge part of your model that you would want a lot of accountability in those settings.
But let’s say someone came to my class…I would hope they would say I was good at giving a whole class lesson. TNTP thought so last year, and the data from MCAS after the fact backs it up (100% A+P; 66% A). Or does it? What percentage of those great results are a product of whole class instruction and what percentage happened in other settings? Can’t say. But I can say I was proud of my whole class work last year.
But this year (to a greater extent than in recent years) I’m finding that my kids are not leaving class where I want them. But I wonder if an observer would know that. Behavior isn’t off, I’m doing a better job delivering the point, kids seem engaged, work in class isn’t terrible – but exit tickets and/or HW is atrocious, so clearly I haven’t done a good enough job teaching them.
What would the observer see though? Perhaps they might see a greater level of frustration, but perhaps not.
There are a million reasons why this is happening – new curriculum, different group of kids, etc. If I wasn’t doing all the other stuff outside of class, I’m not sure most of them would be learning even the scant amount they are getting. That’s not where I want them…I think tutoring should be a stop-gap, not where all of the work is happening. I want my whole class lessons to get at least 80% of them over the hump. Right now, it’s more like 20%.
I’ve been practicing the non-whole class stuff for almost a decade now, so I’m good at it and I trust that I can get the kids where they need to be by the end of the year even if I can’t turnaround the whole class part. But that’s not good for the kids and it’s not good for me.
I think this post is holding up only slightly better than your David Ortiz analysis.