A Statistical Battleground

Given that I’ve just launched into a deeper reading of John Hattie’s book, Visible Learning, I’m taking a keen interest in the current statistical battle rumbling away on Twitter. I’ve come across two bloggers, both as far as I can tell with statistical backgrounds, who are making the strongly-worded point that the Effect Size is a statistical technique being applied incorrectly in education and other social science research.

According to @Jack_Marwood “The Effect Size should be used to check whether an experiment will have enough data from which to draw valid conclusions before the experiment takes place. In most educational research, the Effect Size is used to compare different methods of teaching or outcomes of a change in educational process. There is no justification for this.”

The view of @OllieOrange2 is that “Mathematicians don’t use it”.

Whether this is damning or not is debateable. From the arguments, as far as I can follow them, it seems as though using the Effect Size to work out the relative difference between the means of two data sets might not achieve the level of perfection mathematicians strive for, but that with some provisos, it is roughly what education research wants to know. It would be really helpful if a whole bunch of independent statisticians weighed in with an informed opinion (that looked at the big picture rather than obsessing over the mathematical niceties) but since that’s not very likely, I think there is a fair weight of academic thought in favour of the Effect Size as an imperfect but useful measure. I would also claim to have followed the statistical arguments as far as they go (although I wouldn’t presume to be able to spot errors in the presentation of these) and my view is that the case for the provisos is pretty clear but the case for the damnation of the Effect Size has not been sustained. So then this becomes a list of those provisos.

First there is the difference between the means in a pre-, post-test design, and the difference between the means in an intervention and control group design. OllieOrange2 highlights this issue. I can see that if the time scale is quite long then this matters because the effect size of a programme evaluated over a year that had zero influence on achievement could be either about 0.40 (typical for one year of schooling) or 0.00 depending on which methodology is used. OllieOrange2 is inferring that (a) this will be a big discrepancy for lots of studies (b) Hattie hasn’t noticed the difference. However, I can’t see how a pre-, post-test design over a year could show anything unless it was compared to a control or there was some kind of regression analysis to pull out the effect of the variable being studied from the background, so I’m not sure it’s a problem. For short-scale studies the two methodologies would converge. It would be nice to know whether or not Hattie had thought about this though.

In reading Visible Learning I think that a bigger issue may be the difference between research evaluating interventions, and research comparing pre-existing situations. Hattie is very clear that “almost everything works” and uses the mean of all the Effect Sizes from all the meta-analyses to state that 0.40 is generally the bar that should be set before influences are judged. This is a Hawthorne Effect at work where students, and perhaps more significantly, teachers respond to the novelty of the intervention by pulling a few stops out. It makes a lot of sense for an intervention but some influences are different. As a blatant example, the researchers correlating birth weight with achievement cannot possibly have influenced any embryos, and if the achievement data was taken from existing records (as it presumably was) then they cannot have influenced the achievement scores either. So for birth weight, the full 0.54 applies – there isn’t some kind of 0.40 Hawthorne Effect – and this applies to a number of other influences too. The two types of influences are not separated, and the difference doesn’t seem to be mentioned in Visible Learning.

The age of the students being studied makes a big difference. the graph in this post shows this very clearly, but I quite like the height example I’ve just thought up – the Effect Size of 6″ heels on height is bigger for someone 4’6″ compared to someone 5’8″ (reference to height comes from Hattie but the high heels are all mine). Hattie clearly synthesises meta-analyses without adjusting for this. Possibly there is enough random variety of student ages in the original studies to compensate a bit but it’s a clear limitation of Hattie’s work.

Homogeneity of the students being studied is also significant. This is because the Effect Size is relative to the SD so if the students are closer in achievement then the SD is smaller so the Effect Size becomes bigger. This again is a clear limitation, particularly where the original studies by their nature focused on restricted groups.

Dylan Wiliam has made the point that trying to alter something that responds well to teaching will tend to produce a larger effect size. Having said that, if working hard at something doesn’t have much effect because it doesn’t respond well to teaching, that’s quite a good reason for leaving it alone and spending the effort on something more effective. Given the calibre of the author I may be missing something but the lengthy comment quoted in this post (and the reply in the comments) are available, and this .ppt from the Presentations page on his website. I’m sure I have also read a full paper on this but I can’t just find it at the moment.

Before finishing, I think this paper (that I can’t find) on the limitations of the Effect Size is proabably the best criticism I’ve read. It seems more balanced than the recent blog posts and less focused on mathematical issues which may not matter too much at the level at which education research operates. In contrast Rob Coe’s CEM and EEF briefings describe the advantages of using the Effect Size.

As I’ve been writing this short post, to clarify my thinking, and maybe take stock of a project that might be a waste of time, I’ve stumbled across several other relevant pieces on the subject. Neil Brown’s review of Visible Learning is good, and he also reported briefly on Robert Coe’s ResearchEd2013 debate with OllieOrange2. This post by Leafstrewn is an early criticism and references the “Norwegian debate”, which is reported in this (by me) unpronounceable post. EvidenceIntoPractice is always a mine of useful information on research and issues with meta-analyses is no exception.

And subsequent to this post, there has been a very important and potentially significant debate on using Value-Added Models to measure educational effectiveness. I think a lot of the research on which Visible Learning is based will be using different ways of assessing outcomes, partly because VAM are quite a new approach. However I think it’s worth flagging up here as it’s clearly related, particularly to current work like the EEF projects. The most comprehensive review against using VAM as an evidence-base for policy I’ve seen is a Washington Post article, with a perhaps more politically-aware statement form the American Statistical Association.

2 thoughts on “A Statistical Battleground

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s