Friday, June 12, 2009

Zipf's Law

In response to a comment I received for the post "What's on the Radio?, Part 2," yes, the data is largely Zipfian.

Zipf's Law is an empirical law that states that given some collection of natural language utterances, the frequency of any word is inversely proportional to its frequency rank. This means that in a body of text such as all the words contained in this blog, the most frequent word ("the") should appear about twice as often as the second most frequent word; the second most frequent word should appear about twice as often as the fourth most frequent word; and so on.

Something that has intrigued statisticians for decades is the fact that many types of data besides natural language data also can be well approximated by Zipfian distributions. The test to see if some data is Zipfian is to plot the logarithm of the frequency by the logarithm of the rank. If the resulting data points tend to fall along a straight line, then the data is Zipfian.

In this case, which is only analogous to the natural language domain, frequency of words is replaced by airtime of composers. We see in the scatter plot above that the data is in fact Zipfian for a large range of values.

No comments: