You are currently browsing the category archive for the ‘Statistics’ category.

I’ve been thinking a little bit about covariance matrices recently, particularly how to estimate them from sample data. A typical problem in finance is to estimate a covariance matrix for a universe of stocks from a finite sample of returns data. The problem is that the number of parameters quickly becomes large, and can completely dwarf the number of samples you have available.

For example, consider trying to estimate the correlation of the stocks in the S&P 500 from a year of daily returns data — overall that’s about $500 \times 250 = 125,000$ pieces of information, to estimate a total of $\tfrac{1}{2} \times 500 \times 501 = 125,250$ parameters. Oops. As if that wasn’t bad enough, you are building a matrix with $500$ rows out of only $250$ columns — this means that your matrix is going to be of rank $250$ at best, and possibly lower. In particular, it won’t be invertible. If your application requires an invertible covariance matrix — say, if you’re building a Markowitz portfolio — then you’re already out of luck.

Fortunately there are several techniques that can be employed to improve the estimation of the covariance matrix, and ensure that it has full rank and is invertible. The simplest of these is called shrinkage, which is a fancy name for something simple: we transform to the correlation matrix, scale that toward the identity matrix, and then transform back into correlation space:

$\Sigma_{\rm new} = \lambda \Sigma_{\rm sample} + (1-\lambda) I$

for some $0\leq\lambda\leq 1$. This ensures that the correlation matrix, and hence the covariance matrix, is invertible. Over the next few posts I’m going to look into some more advanced techniques, including techniques that allow practitioners to incorporate their prior beliefs about market structure into the model.

Everyone knows that movie sequels nearly always suck, and that trilogies are the worst of the bunch. If you want a sure-fire way to ruin a good movie, then make two sequels.

But how does this piece of folk knowledge really hold up? I decided to investigate this afternoon, and work out quantitatively (using science) just how bad the second and third movies in a trilogy are.

My first task was to make a list of trilogies — using a combination of asking my friends on Twitter and my officemates, I got a list of 38 well known trilogies, including The Godfather, Jaws, Star Wars (I-III and IV-VI) and Die Hard. I then visited the IMDB page for each movie and recorded its aggregate score out of ten. These could then be plotted on a graph, to indicate visually just how much sequels suck:

Ratings out of ten for 38 movie trilogies.

One unsurprising result is that the second movie in a trilogy tends to be worst than the first, and the third movie tends to be worse than both of those. The average rating for the first movie was 7.58, as compared to 6.82 for the second movie and 6.38 for the third.

Something a bit more surprising is that the second and third movies in a trilogy aren’t uniformly worse — instead there tends to be more of a spread in quality, whereby second movies are much more variable in quality than first movies, and third movies are even more so. This can be seen in the standard deviations of the ratings: 0.73 for the first movie, 1.2 for the second movie and 1.4 for the third.

Finally, as with any rule, there are exceptions. The most striking is Sergio Leone’s Man With No Name trilogy, comprising A Fistful of Dollars, For a Few Dollars More, and The Good, The Bad and the Ugly. Each of these movies was better than the last, with the third movie getting an average rating of 9.0, the highest of any third movie in a trilogy.

Similarly, The Lord of the Rings maintained high ratings throughout, with the third movie being rated the best of the trilogy, although in this case the effect isn’t as dramatic. It’s noticeable that with both these exceptions, the three movies were conceptualized as a trilogy, planned that way from the beginning, rather than the second and third movies being made to cash in on the success of the first. Perhaps there is something to artistic merit after all.

I can make the full data set available on request, if anyone thinks they can do something cool with it.