Variance, the misleader!

Hello World,

Whenever you look at a phenomenon and want to know how variable it is, you certainly go for the variance. It’s a natural measure of dispersion, but it’s got a number of shortcomings that might mislead you and make you interpret wrongly your data.

First of all, unless the phenomenon you look at is normally distributed, variance is but one of the many possible ways of assessing dispersion. It is the second order moment about the mean of the distribution, and in the normal distribution, all centred moments (moments about the mean) are functions of variance. For example the kurtosis is 3 times the squared variance. When the distribution is not normal, all moments of even order are measure of dispersion, and they are not necessarily related to one-another. So in a non-Gaussian phenomenon, you may have a small variance but disproportionately large chances of observing very large deviations from the mean… That changes you perception of the variability, wouldn’t you say?

Second, and even more often ignored than the first point, variance is the second order moment about the mean… and in a sample its estimate might, well, mislead you. Indeed, even if your careful enough to use an unbiased estimator for variance (the sum of the squared differences between each point in the sample and the sample mean, divided by “the number of points in your sample minus one”), that mean may be varying itself!

As soon as you suspect that what you observe could either be a function of time and/or another variable (time series, most economic observations, and so on) you might well be mislead by variance. Think of phenomenon where the response you observe is a linear function ot the time elapsed since a certain initial time. Even if there your responses are perfectly aligned (one might consider them as being exactly at their mean, every time, but with a mean which linearly changes with time) you’ll still get a sample mean (the value of the phenomenon at about the “middle time”) and a variance. The steeper that time-change is, the bigger the variance will be. But still, there’s no dispersion at all about “where the observations should be”!

The same goes, in a not so obvious way, when the relation is not linear, or when the second variable is not time but a (often poorly measured) other phenomenon. You may have almost not dispersion about the “good values” and still a huge variance.

What to do, then? Well the first step would be to find a decent model of your data. Possibly by ways of least squares regressions (you regress your observations on time, or on the other variables you suppose are at play). Linear or not linear. Polynomial approximations may help you if the linear model is not satisfying. Then, you can use (1 – R²) times the variance as a decent estimate of how intrinsically dispersed/variable your variable really is, R² being the coefficient of determination of your regression. The variance times R² is the part of the variability that comes from the “changing means” and really shouldn’t be called “variance”.

That’ll hopefully help you avoid being mislead by variance too often!

Freedom!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s