## Sunday, February 03, 2013

### Regression to the Mean Is Not an Explanation!

Daniel Kahneman treats regression to the mean as a form of explanation. (See Thinking, Fast and Slow, pp. 178-183.) He also says that when we see regression to the mean, what we are seeing has "does not have a causal explanation" (p. 178).

I say both these contentions are nonsense. (Once again, let me put in my usual caveat: I greatly admire Kahneman's work as an experimental psychologist. But here he is doing philosophy of science: he has left his area of expertise and is forwarding ideas which [I contend] he cannot defend,  and which certainly cannot be decided by any experiment. And I will also note that citing regression to the mean is fine as a way of justifying a prediction.)

Why is regression to the mean not an explanation of any empirical fact? Because it is a tautology, and tautologies never explain empirical phenomena. In particular, in this case, regression to the mean always holds, because something won't be the mean unless it is regressed to. Let us say an NBA player begins his career shooting 20% on three pointers. Then he "gets hot" and begins shooting 40%. If he later falls back to 20% shooting, then he has regressed to the mean. But if he continues shooting 40% from then on, for many years, he simply will move his mean: and subsequent deviations from that shooting percentage will regress to the new mean, or... they will again move the mean.

When incomes began to rise at the beginning of the Industrial Revolution, someone (Malthus, perhaps) might say, "Fine, but they will regress to their mean." Two hundred years later, we simply have a new mean for incomes. And now, we can say incomes will regress to that mean... or a new mean will be established. It is simply contained in the definition of a mean that either events "regress" to it, or it will no longer be the mean. Nothing at all is explained about what is going on by citing "regression to the mean."

Further, seeing a regression to the mean does not mean there is no causal explanation for what is occurring. If the 20% shooter regresses to 20%, the explanation is, "He is only skilled enough as a three-point shooter to hit 20% on his threes in the NBA, although he may have streaks of 'good luck' in which he shoots better." (And if he doesn't, the explanation is, perhaps, "He worked on his shot a lot." Or, perhaps, "He had a bad streak to start out, but he was always skilled enough to shoot 40%.")

Kahneman's experimental work is fantastic: I absolutely love his cleverness at devising experiments to unearth the facts he presents. But the philosophical extrapolations he makes from that work are often faulty.

1. "Why is regression to the mean not an explanation of any empirical fact? Because it is a tautology, and tautologies never explain empirical phenomena. In particular, in this case, regression to the mean always holds, because something won't be the mean unless it is regressed to. Let us say an NBA player begins his career shooting 20% on three pointers. Then he 'gets hot' and begins shooting 40%. If he later falls back to 20% shooting, then he has regressed to the mean. But if he continues shooting 40% from then on, for many years, he simply will move his mean: and subsequent deviations from that shooting percentage will regress to the new mean, or... they will again move the mean."

No, that's not what is meant (I feel like I can speak to this since I've been reading the phrase in this context since I was 16 years old). What is meant is either that players regress to the average of all players or regress to the average of all players of their type. For instance, in baseball if a catcher has a season where he plays all 162 games in his age 36 season, "regressing to the mean" means we should expect him to play far fewer games the following season. In the sense of statistical expectation, it has a ton of empirical content. All player projection systems exhibit a significant component of regression to the mean and they are more accurate for it.

1. Ryan, I took a single player as a simple example. Of course I understand the concept in terms of populations to -- see my remarks on average income. And everything I say applies just the same to that situation.

"All player projection systems exhibit a significant component of regression to the mean and they are more accurate for it."

You missed the part where I said, "Of course, this is very important for predictions"?!

2. So, I write "Regression to the mean is not an *explanation* of any empirical facts. But of course, it is very important for predicting them."

Ryan writes back, "Boy are you wrong, Gene! Regression to the mean is very important for predicting empirical facts!"

3. I flip a coin ten times and it comes up heads nine times. I flip it 90 more times and in all it comes up heads 54 times. Am I not explaining anything if I say that is because of regression to the mean?

4. Of course not. That is just a tautology, as I note above! You might be explaining something if you spoke of a tendency inherent in the coin and the act of flipping to produce roughly half heads and half tails. That tendency explains both the mean and the regression to the mean.

"Regression to the mean" certainly did not *cause* your coin to behave as it did: it is a statistical tautology, not a causal power!

5. Okay, here is a different example. One early metric in sabermetrics was Catcher Earned Run Average, or CERA. What you did was measure the ERA of the pitchers when the catcher was catching. This was supposed to capture game calling ability, pitch framing, etc. However, it was quickly determined by others that the year-to-year persistence in the statistic was entirely an artifact of the persistence in the quality of pitchers. When you controlled for that, the year-to-year persistence disappeared. Some catchers still did better than other catchers in individual years after controlling for it, but that wasn't because of skill. It wasn't because of anything. It was sheer random disturbance. There is no reason to ask "why" or "what caused" it, because there was no causation to speak of. The catchers just happened to catch better quality starts than what those pitchers normally pitched.

**Note that with different methods in the last two years they've been able to better measure catcher defense, but CERA was always wrong.

2. This is one of those situations where you can shift the error to wherever you please.

In the baseball situation, for example, I could say that the more correct thing to look at would be the statistics within a season. Because the assumption of "regression to the mean" is that there's a Gaussian distribution of which we already know the mean.

I would say call the problems you point to the confusion of different means. But, it's a matter of taste.