Looking Inside Eminem’s Lyrics – part 1

I started analyzing the lyrics of Eminem. My initial interest is that what are the most common words Eminem has been using in his lyrics. I collected the name of all his 287 songs from this link. Then I collected the lyrics using python sontext library which collects lyrics from lyricwiki.org. I have become successful in gathering lyrics for 223 songs of Eminem out of those 287 song names using my python code. After gathering the lyrics, I had to process the text in the lyrics. Normally in language processing tasks, we do chunking, lemmatization, stemming, spelling error check. I have used NLTK library for all these. Actually I had to avoid doing lemmatization as it was chopping off lots of interesting words in its existing form. And also I created a banned words list as Eminem has used a lot of ‘na’, ‘wa’, ‘oh’ kind of words which semantically doesn’t have much meaning.  Then I used NLTK word frequency method to find out the frequency of words. The top 20 words used were

[(u’like’, 1375), (u’get’, 1049), (u’got’, 907), (u’shit’, 740), (u”’cause”, 729),

(u’know’, 701), (u’back’, 674), (u’fuck’, 671), (u’eminem’, 593), (u’go’, 557),

(u’see’, 514), (u’one’, 497), (u’say’, 476), (u’never’, 430), (u’bitch’, 428),

(u’man’, 428), (u’let’, 422), (u’time’, 411), (u’come’, 392), (u’think’, 361)]

And yeah apparently Eminem has cursed a lot in his songs. As you can see in the plot below for the word “shit” (rank:4), “fuck” (rank:8), “bitch” (rank:15).

Frequency of top 20 words

Then the word ‘love’ has been used 282 times just bit less than the word ‘ass’ which was used 295 times. You can see the word ‘dre’ has been used a lot and it’s most likely Dr. Dre who worked with Eminem. The word ‘man’ is used more than the word ‘girl’. The word ‘hate’ is used less than the word ‘love’, only 116 times. Here’s two more plots for word frequency.

Frequency of top 50 words
Frequency of top 100 words

Anyway, simple bag of words probably don’t give good representation of a particular song. For example, the word love can be used in a sentence “I love you” but then also “I don’t love you” which has completely opposite meaning. But here they are counted all together. Before contextual analysis, I was just thinking about doing another frequency analysis according to Russel’s model of mood. You basically divide the xy-plane into four orthogonal regions as you can see in the image below.

Model of mood

I want to see where eminem’s music in general fall in this emotional plane. There’s more interesting analysis I can do later on using word vector and other new NLP techniques. I’ll eventually look into other artists, other genres and try to find whether there are different patterns in how the words are chosen and the kind of emotion certain songs may generate.

Code for getting Lyrics:

import lxml
from lxml import html
import requests

import pickle

import numpy as np
import libsongtext
from libsongtext import lyricwiki
from libsongtext import utils

import pprint as pp

artist_name = 'eminem'

url = 'http://www.spin.com/2014/10/eminem-every-song-ranked/'
#f = urllib.urlopen(url)
f = requests.get(url)

html_page = f.content#f.read()
tree = html.fromstring(html_page)

song_name_xpath=tree.xpath('//div[@class="article-content clearfix"]//strong/a')

num = 1
lyrics_list = {}
lyrics_not_found_list = []
success_lyrics_cnt = 0
for s in song_name_xpath:
song = ''.join(s.text.encode('ascii', 'ignore').strip())

print 'No. ' + str(num)
num = num + 1
print 'track : ' + song

args = {}
args['artist'] = artist_name
args['title'] = song.strip()
title = args['title']
if not lyrics_list.get(title):
t = lyricwiki.LyricWikiSong(args)
lyrics = t.get_lyrics()
print "Got Lyrics."
lyrics_list[title] = lyrics
success_lyrics_cnt += 1
print "Failed to get Lyrics."

print 'Successfully got ', success_lyrics_cnt, ' lyrics out of ', len(song_name_xpath), ' tracks'

def save_obj(obj, path, name):
with open(path + name + '.pkl', 'wb') as f:
pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def load_obj(path, name):
with open(path + name + '.pkl', 'rb') as f:
return picle.load(f)

save_obj(lyrics_list, '/Users/andy/Documents/projects/music/lyrics_database/eminem/', 'eminem_song_lyrics')

Code for word frequency analysis in Lyrics:

import pickle
import string

import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk.corpus import stopwords

import enchant
from enchant.checker import SpellChecker

eng_dict = enchant.Dict("en_US")

#import lyrics of eminem
f = open(eminem_lyrics_pickle_file, 'rb')

lyrics= lyrics_list.values()

# english words
#words = set(nltk.corpus.words.words())

porter = nltk.PorterStemmer()
wnl = nltk.WordNetLemmatizer()

def plot_freqdist_freq(fd,
title='Frequency plot',
As of NLTK version 3.2.1, FreqDist.plot() plots the counts
and has no kwarg for normalising to frequency.
Work this around here.

- the FreqDist object
- max_num: if specified, only plot up to this number of items
(they are already sorted descending by the FreqDist)
- cumulative: bool (defaults to False)
- title: the title to give the plot
- linewidth: the width of line to use (defaults to 2)
OUTPUT: plot the freq and return None.

tmp = fd.copy()
norm = fd.N()
for key in tmp.keys():
tmp[key] = float(fd[key]) / norm

if max_num:
tmp.plot(max_num, cumulative=cumulative,
title=title, linewidth=linewidth)


stem_tokens = ['ed', 'ies', '\'s' , 'n\'t', '\'m', '--', '\'\'']
banned_words = ['ha', 'wa', 'ta', 'u', 'i', 'ai', 'na', 'ca', '...', '..', '\'em', '\'en', 'wan', '`', '``',
'oh', 're', '\'re', '\'ne', 'yea', 'yeah', 'ya', 'yah', '\'ve', '\'d', 'wo', 'oh', 'ooh',
'\'ll', 'yo', 'is\\u2026', 'ah', 'wit', 'would', '\\u2019']

#['i\'ma', 'y\'ll']

def synonyms(word):
syns = []
for word in wn.synsets(word):
sim_words = word.similar_tos()
sim_words += word.lemma_names()
for sim in sim_words:
s = sim
if hasattr(s, '_name') :
s = sim._name.split(".")[0]

syns = set(syns)
return syns

def stem(word):
for suffix in stem_tokens:
if word in banned_words:
return False

if word == 'suffix' or word.endswith(suffix):
return word[:-len(suffix)]
return word

lyrics_edited = []
chkr = SpellChecker("en_US")

edited_tokens = []
i = 1
for s, l in lyrics_list.items():
print i, ". Processing song: \"", s, "\""
i += 1
# find wrongly spelled words
for err in chkr:

tokens = word_tokenize(l)
l_txt = nltk.Text(tokens)

for t in tokens:
tn = t.lower()
#tn = porter.stem(t)
#tn = wnl.lemmatize(tn)

tn = stem(tn)
if tn and tn not in err_words and tn not in stopwords.words('english') and tn not in list(string.punctuation):

uniq_tokens = set(edited_tokens)

fdist = nltk.FreqDist(edited_tokens)

#Rusell's Model of mood
mood_happy_words = ['Exhilarated', 'Excited', 'Happy', 'Pleasure']
mood_h = []
for ws in mood_happy_words:
for w in synonyms(ws):

mood_h = list(set(mood_h))

mood_angry_words = ['Anxious', 'Angry', 'Terrified', 'Disgusted']
mood_a = []
for ws in mood_angry_words:
for w in synonyms(ws):

mood_a = list(set(mood_a))

mood_sad_words = ['Sad', 'Despairing', 'Depressed', 'Bored']
mood_s = []
for ws in mood_sad_words:
for w in synonyms(ws):

mood_s = list(set(mood_a))

mood_relaxed_words = ['Relaxed', 'Serene', 'Tranquil', 'Calm']
mood_r = []
for ws in mood_relaxed_words:
for w in synonyms(ws):

mood_r = list(set(mood_r))