What does lesson observation research actually say?

Buffy took seven complete series before the First Evil was finally defeated when Spike’s amulet channeled the power of the Sun into the Hellmouth and Sunnydale High School collapsed into a hole that makes the VW swallowing Buckinghamshire effort seem pretty tame. It’s looking as though it might take more like a mere seven months for the research on reliability of lesson observations, unleashed by Rob Coe, to do the same for the graded lesson observations that have stalked the corridors of our own schools, devouring innocent teachers, for many a year.

I have never believed that teacher effectiveness could be judged on three graded lesson observations per year; I cannot see how Ofsted inspectors can believe that the teaching charade they view during an inspection gives them much useful information about the quality of teaching and learning in a school; and I think that basing PRP decisions on individual lesson observations comes close to breaking employment law. I will be happy to see these worst excesses of the system swept away, and if that’s the end of graded observations entirely, well maybe it’s a price worth paying. But if we want to measure teacher effectiveness (I’ll leave the argument about whether we do or not for another time) how are we going to do it now?

My first suggestion is that, if the MET project that Coe has been referring to is good enough research to justify binning graded lesson observations then, given that MET stands for Measures of Effective Teaching, it should be good enough research to suggest how we might validly and reliably measure just that. The culminating findings make the following points:

  • It is definitely possible to measure teacher effectiveness. Teachers were assessed and then pupils were assigned randomly and the earlier assessment was used to predict student outcomes. Those teachers who had been identified as more effective did have better student outcomes on average.
  • There are some subtleties to this, however. My interpretation is that for any one individual teacher it’s possible that student achievement gains in a particular year would not match their assessed level of effectiveness so that means no guarantees that a teacher identified as particularly effective will not have a year with poor outcomes, but the original measurement of effectiveness is solid.
  • “Estimates of teachers’ effectiveness are more stable from year to year when they combine classroom observations, student surveys, and measures of student achievement gains than when they are based solely on the latter.” I presume this is because of the noise in the student achievement gains.

So this leaves me thinking that, if we want to assess teacher effectiveness, we can do so, using a combination a VA, student surveys, and lesson observations. It’s tempting to think that having several years of data would average out the noise and make that the stand-out indicator but it’s crucial to realise that the whole point of the randomisation in this research was because without it there was no way to decide whether differences in student outcomes were due to teachers or due to other factors – in other words, other factors do matter. This research definitely does not suggest that we can just ignore which classes a teacher has worked with and rely on VA.

As soon as I start to think about transferring all this to a typical English school, with it’s busy teachers and SLT, small and sometimes imperfect data-sets, and varied classes, I find myself in strong agreement with Tom Sherrignton’s blog post “How do I know how good my teachers are?” I don’t think there will ever be a perfect measure but we can have a pretty good stab at it. And lesson observations are part of this.

My second question is about what that research actually says about the reliability of graded lesson observations. Coe’s figures have been widely circulated. I’m not going to dispute them but I am going to query whether it’s possible to generalise those findings to our current system and comment on what the MET project says about making observations (which they are suggesting are important in assessing teacher effectiveness) more reliable. That’s for another post, coming soon.


Graded Lesson Observations: Defibrillation or a Stake through the Heart?

An observer enters your classroom. Is this person your HoD, the assistant head with responsibility for T&L, an Ofsted inspector, or a demon who has occupied a corpse and is coming to suck your blood? A fair number of commentators have recently suggested the latter and have been sharpening words, and presumably a variety of sticks, with a view to dispatching said vampires to the demon dimensions. Like Rupert Giles, Robert Coe from Durham University CEM (possibly a pseudonym for the Watchers Council) has been quietly dispensing the wisdom of the ancients academics, guiding the Slayers in their quest. But is the graded lesson observation really the personification of evil, or does it have a soul worth saving?

Wilshaw’s Westminster Education Forum speech on 7th November 2013 included the line: “Which ivory towered academic, for example, recently suggested that lesson observation was a waste of time – Goodness me!” Does Wilshaw need to pay more attention to the ivory towered ones? Is his organisation trying to perform a task as fundamentally uncertain as measuring the combined momentum and position of a sub-atomic particle; is it engaged in a legitimate assessment technique but doing it in a slightly crap way; or is the Ofsted Christmas party actually a masquerade ball of orgiastic hedonism where innocent teachers are dragged to be ripped assunder in a feeding frenzy of unimaginable gore?

In ITT, observations are a big part of how we assess the progress of trainees. It doesn’t feel as though the judgements we make are unreliable; over the course of a number of observations, we would feel confident that an accurate picture of a trainee’s teaching was being drawn. Are we deluding ourselves when we reflect on this practice; are we even capable of reflection…

If you pick up Robert Coe’s blog entry on this you’ll see that he is linking to two pieces of research. The first is the massive (and massively well-funded – thanks Bill & Melinda) MET project. Now, I make no claims to either the academic clout of Robert Coe, or to expertise in this area, but reading the MET policy and practice brief  I can see where Coe’s figures are coming from, but not his conclusion that observations are unreliable to the point of worthlessness as a measure of teacher performance. The MET project seems to me to be making suggestions about how to improve the reliability of observations not concluding that they are good only for a staking. Of course, like Wilshaw, anyone involved in a project called “Measuring Teacher Effectiveness” may be somewhat biased towards the idea that it is actually possible to measure such a thing, and continued research funding may even depend on that outcome, but the MET project is looking at a range of ways to measure teacher effectiveness and I can’t see why, if they were looking at data that suggested observations were a waste of time, they wouldn’t say so and recommend a system based on other measurement methods.

Strong, Gargani & Hacifazlioğlu (2011) is the other piece of research. It’s behind a paywall but for good papers there’s often an academic somewhere that has breached their institutions copyright rules and posted it somewhere helpful. In interpreting the results, it’s important to appreciate that of the three experiments, two involved judging teachers on the basis of two minute clips of whole-class teaching (chosen to avoid any behavioural management incidents!). However, the third experiment did involve observations of videos of whole lessons, but using a complex observational protocol – the CLASS tool – that seems to weight student engagement and various other, dare I say it, constructivist ideals quite strongly. Coe is right to state that the ability of observers to pick good teachers in these experiments was in the same league as Buffy’s ability to pick good boyfriends but he leaves out at a crucial point which I think I’d better quote.

This analysis showed that a small subset of items produced scores that accurately identified teachers as either above or below average. All of these items were from the instructional domain. They included clearly expressing the lesson objective, integrating students’ prior knowledge, using opportunities to go beyond the current lesson, using more than one delivery mechanism or modality, using multiple examples, giving feedback about process, and asking how and why questions.

The final point made in the paper is that “This… has motivated us to undertake development of an observational measure that can predict teacher effectiveness.”

So I’m not sure that Coe has it right on this evidence. Yes, we all (ITT, Ofsted, and school leaders) need to recognise that sloppy observation procedure and training will lead to meaningless judgements. Yes, using graded observations for staff development may be a bit like burning witches to improve their chances at the last judgement. Yes, value-added data may be a better, or even the best, method for judging the effectiveness of a teacher and/or their teaching. But, in ITT where value-added data does not exist, I think my colleagues and I really ought to be bringing some of the academic clout of our Faculty to bear on using research like this to develop a model for lesson observation that delivers reliable outcomes. I’ll let you know how we get on, and give you a shout if we need any stake holders.