Lyric Analysis - Rolling Stone Top 100 Artists
Also check out a similar analysis on Reddit's Default Subreddit comments
After seeing this analysis of the lyrics used by hip hop artists, I was curious to see how these numbers stacked up with other artist in other genres. So this is what I've got so far ...
Rolling Stone Magazine have compiled a list of their top 100 musical artists of all time. I figure that's probably a pretty good place to start, it's an interesting list with a wide range of artists and some of the entries make for good reading.
Using that list as a base I pulled in a discography for each of those artists from LyricWikia, using their API. I wasn't able to collect a discography for all the artists but I got most of them.
I then went on to collect the lyrics for those songs from Chartlyrics (I would encourage you to consider contributing to their free community). Their API appeared to be rate limited to 1 request every 20 seconds so I ended up scraping the lyrics web pages. I couldn't get all the songs for all the artist but I got enough to work from.
Below is a visualisation of the lyrics I collected. I've filtered out the artist where I collected less than 10,000 words. Click an artist lexical density bar on the left and a bubble chart with that artist's top N words is presented, N can be adjusted to show more or less words. There is also a set of radio buttons allowing you to filter out sets of words. "None" includes all words in the artitst top list. "Common" filters out all words in Wikipedia's 100 most common words list (just base words, not lexemes). "Pronouns" filters out pronouns. The right set of bars shows the number of times each word appears when the top 100 words for all artists are combined. The bar colour changes to match the colour of the word in the bubble chart.
Lexical Density seems like a reasonably good indicator of the vocabulary used by each artist, it differs from the measure use in the orignal hip hop analysis but I think it allows for a good comparison between these various artists. As a point of reference with the hip hop data, Ice T, pretty much in the middle of the pack, has a lexical density of (4431/35000)*100 = 12.66%. The lowset entry is DMX at 9.18%, and the highest Aesop Rock at a whopping 21.12%.
Note: I've just realised that what I've calculated here is not a true lexical density, it's a very simplified approximation. I'm working towards implementing a true lexical density but that may take a bit of time. Until then I think the calculation I've got here still provides a good measure to compare these artists. Here's a similar analysis that only looks at the first 20000 words from each artist and doesn't try and calculate lexical density.
So that's what I've got at the moment, it's a work in progress and I've got a few more ideas about what I can do with this set of data. I'll see where it goes next.
(Note: Tool and NOFX do not appear in Rolling Stone's list, they were used in some early test cases for the lyric collection process and I've decided to keep them in there)
Let me know what you think, moc.liamg@timmorton
Click an Artist Below
Number of words = (max 100)
Exclude: None
Common
Pronouns