Looking Inside Eminem’s Lyrics – part 1

I started analyzing the lyrics of Eminem. My initial interest is that what are the most common words Eminem has been using in his lyrics. I collected the name of all his 287 songs from this link. Then I collected the lyrics using python sontext library which collects lyrics from lyricwiki.org. I have become successful in gathering lyrics for 223 songs of Eminem out of those 287 song names using my python code. After gathering the lyrics, I had to process the text in the lyrics. Normally in language processing tasks, we do chunking, lemmatization, stemming, spelling error check. I have used NLTK library for all these. Actually I had to avoid doing lemmatization as it was chopping off lots of interesting words in its existing form. And also I created a banned words list as Eminem has used a lot of ‘na’, ‘wa’, ‘oh’ kind of words which semantically doesn’t have much meaning.  Then I used NLTK word frequency method to find out the frequency of words. The top 20 words used were

[(u’like’, 1375), (u’get’, 1049), (u’got’, 907), (u’shit’, 740), (u”’cause”, 729),

(u’know’, 701), (u’back’, 674), (u’fuck’, 671), (u’eminem’, 593), (u’go’, 557),

(u’see’, 514), (u’one’, 497), (u’say’, 476), (u’never’, 430), (u’bitch’, 428),

(u’man’, 428), (u’let’, 422), (u’time’, 411), (u’come’, 392), (u’think’, 361)]

And yeah apparently Eminem has cursed a lot in his songs. As you can see in the plot below for the word “shit” (rank:4), “fuck” (rank:8), “bitch” (rank:15).

Frequency of top 20 words

Then the word ‘love’ has been used 282 times just bit less than the word ‘ass’ which was used 295 times. You can see the word ‘dre’ has been used a lot and it’s most likely Dr. Dre who worked with Eminem. The word ‘man’ is used more than the word ‘girl’. The word ‘hate’ is used less than the word ‘love’, only 116 times. Here’s two more plots for word frequency.

Frequency of top 50 words
Frequency of top 100 words

Anyway, simple bag of words probably don’t give good representation of a particular song. For example, the word love can be used in a sentence “I love you” but then also “I don’t love you” which has completely opposite meaning. But here they are counted all together. Before contextual analysis, I was just thinking about doing another frequency analysis according to Russel’s model of mood. You basically divide the xy-plane into four orthogonal regions as you can see in the image below.

Model of mood

I want to see where eminem’s music in general fall in this emotional plane. There’s more interesting analysis I can do later on using word vector and other new NLP techniques. I’ll eventually look into other artists, other genres and try to find whether there are different patterns in how the words are chosen and the kind of emotion certain songs may generate.

Code for getting Lyrics:

import lxml
from lxml import html
import requests

import pickle

import numpy as np
import libsongtext
from libsongtext import lyricwiki
from libsongtext import utils

import pprint as pp

artist_name = 'eminem'

url = 'http://www.spin.com/2014/10/eminem-every-song-ranked/'
#f = urllib.urlopen(url)
f = requests.get(url)

html_page = f.content#f.read()
tree = html.fromstring(html_page)

song_name_xpath=tree.xpath('//div[@class="article-content clearfix"]//strong/a')

num = 1
lyrics_list = {}
lyrics_not_found_list = []
success_lyrics_cnt = 0
for s in song_name_xpath:
song = ''.join(s.text.encode('ascii', 'ignore').strip())

print 'No. ' + str(num)
num = num + 1
print 'track : ' + song

args = {}
args['artist'] = artist_name
args['title'] = song.strip()
title = args['title']
if not lyrics_list.get(title):
t = lyricwiki.LyricWikiSong(args)
lyrics = t.get_lyrics()
print "Got Lyrics."
lyrics_list[title] = lyrics
success_lyrics_cnt += 1
print "Failed to get Lyrics."

print 'Successfully got ', success_lyrics_cnt, ' lyrics out of ', len(song_name_xpath), ' tracks'

def save_obj(obj, path, name):
with open(path + name + '.pkl', 'wb') as f:
pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def load_obj(path, name):
with open(path + name + '.pkl', 'rb') as f:
return picle.load(f)

save_obj(lyrics_list, '/Users/andy/Documents/projects/music/lyrics_database/eminem/', 'eminem_song_lyrics')

Code for word frequency analysis in Lyrics:

import pickle
import string

import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk.corpus import stopwords

import enchant
from enchant.checker import SpellChecker

eng_dict = enchant.Dict("en_US")

#import lyrics of eminem
f = open(eminem_lyrics_pickle_file, 'rb')

lyrics= lyrics_list.values()

# english words
#words = set(nltk.corpus.words.words())

porter = nltk.PorterStemmer()
wnl = nltk.WordNetLemmatizer()

def plot_freqdist_freq(fd,
title='Frequency plot',
As of NLTK version 3.2.1, FreqDist.plot() plots the counts
and has no kwarg for normalising to frequency.
Work this around here.

- the FreqDist object
- max_num: if specified, only plot up to this number of items
(they are already sorted descending by the FreqDist)
- cumulative: bool (defaults to False)
- title: the title to give the plot
- linewidth: the width of line to use (defaults to 2)
OUTPUT: plot the freq and return None.

tmp = fd.copy()
norm = fd.N()
for key in tmp.keys():
tmp[key] = float(fd[key]) / norm

if max_num:
tmp.plot(max_num, cumulative=cumulative,
title=title, linewidth=linewidth)


stem_tokens = ['ed', 'ies', '\'s' , 'n\'t', '\'m', '--', '\'\'']
banned_words = ['ha', 'wa', 'ta', 'u', 'i', 'ai', 'na', 'ca', '...', '..', '\'em', '\'en', 'wan', '`', '``',
'oh', 're', '\'re', '\'ne', 'yea', 'yeah', 'ya', 'yah', '\'ve', '\'d', 'wo', 'oh', 'ooh',
'\'ll', 'yo', 'is\\u2026', 'ah', 'wit', 'would', '\\u2019']

#['i\'ma', 'y\'ll']

def synonyms(word):
syns = []
for word in wn.synsets(word):
sim_words = word.similar_tos()
sim_words += word.lemma_names()
for sim in sim_words:
s = sim
if hasattr(s, '_name') :
s = sim._name.split(".")[0]

syns = set(syns)
return syns

def stem(word):
for suffix in stem_tokens:
if word in banned_words:
return False

if word == 'suffix' or word.endswith(suffix):
return word[:-len(suffix)]
return word

lyrics_edited = []
chkr = SpellChecker("en_US")

edited_tokens = []
i = 1
for s, l in lyrics_list.items():
print i, ". Processing song: \"", s, "\""
i += 1
# find wrongly spelled words
for err in chkr:

tokens = word_tokenize(l)
l_txt = nltk.Text(tokens)

for t in tokens:
tn = t.lower()
#tn = porter.stem(t)
#tn = wnl.lemmatize(tn)

tn = stem(tn)
if tn and tn not in err_words and tn not in stopwords.words('english') and tn not in list(string.punctuation):

uniq_tokens = set(edited_tokens)

fdist = nltk.FreqDist(edited_tokens)

#Rusell's Model of mood
mood_happy_words = ['Exhilarated', 'Excited', 'Happy', 'Pleasure']
mood_h = []
for ws in mood_happy_words:
for w in synonyms(ws):

mood_h = list(set(mood_h))

mood_angry_words = ['Anxious', 'Angry', 'Terrified', 'Disgusted']
mood_a = []
for ws in mood_angry_words:
for w in synonyms(ws):

mood_a = list(set(mood_a))

mood_sad_words = ['Sad', 'Despairing', 'Depressed', 'Bored']
mood_s = []
for ws in mood_sad_words:
for w in synonyms(ws):

mood_s = list(set(mood_a))

mood_relaxed_words = ['Relaxed', 'Serene', 'Tranquil', 'Calm']
mood_r = []
for ws in mood_relaxed_words:
for w in synonyms(ws):

mood_r = list(set(mood_r))


Message Passing using OpenMPI : Production and Consumption via Broker

I am working on several assignments and through the problem solving process, I am learning OpenMPI. OpenMPI is a great library that gives you capability of using multiple CPUs in parallel. Think about you are given 100 random numbers which are not sorted. Normally you would write a sequential program where you will implement a O(nlogn) sorting algorithm like quick sort or merge sort to sort those random numbers in an increasing or decreasing order. But what if you partition the 100 numbers into 10 buckets. And sort each bucket using your sequential sort algorithm parallelly using 10 process. At the end, you need to combine those 10 buckets. This is just an example of how using parallel process, we can speedup. Sometimes certain information needs to be synchronized in all running processes for further computation. Unlike threads, processes don’t share shared memory. So, you can not really use a global variable for storing. Moreover, you can never exactly tell which instruction are being executed by  a process at a certain moment. The way processes communicate with each other is through sending and receiving messages. That is why MPI stands for Message Passing Interface. The first problem that I have been working is a problem where a producer produces item and send item to broker, then broker receives the message, saves it in its internal buffer and finally, sends acknowledgement message back to the producer. However, a broker can only insert a newly produced item into the buffer only if the buffer is not full. In case, the buffer is full, the producers will be waiting. Broker also waits for consumer request to consume items. A consumer sends request to broker and if the buffer is not empty, the broker sends an item for consumption to the consumer on receiving a valid request. While handling the consumer request, the broker also releases the waiting producer by inserting the produced items when the buffer becomes free. This whole production, consumption scheme now needs to run for a given amount of time. There are several challenges in implementing this parallel program. Deadlock is a common issue in parallel  programming. It occurs when two or multiple processes race against each other for accessing the same resource. Imagine a professor is discussing with his two students. Let’s assume the old professor can only remember one question at a time (his memory is like the communication buffer). The first student asks the professor a question, until the professor answers the question, the second student can not ask any question. Now imagine, if the first student is constantly asking question, the second student may not get chance to ask at all. (….)

// Question 1. Producer, Consumer, Broker in MPI

#include <sys/time.h>
#include #include

#define MAX_THREADS 256

// MSG type definition
#define WORK_MSG 1
#define ACK_MSG 2
#define REQ_WORK_MSG 3
#define CONS_MSG 4
#define TIME_OUT_MSG 5

// broker rank definition
#define BROKER_RANK 0

struct timeval tz;
struct timezone tx;

// size and rank of the communication world
int world_rank;
int world_size;

int nprod; // number of producers
int ncons; // number of consumers

int *buffer; // circular buffer for broker to insert into for producers or extract from consumers
int buffer_item_count;
int BUFFER_SIZE; // size of buffer
int head, tail;

int time_out = 0;
int time_out_recommendation = 0; // this is required to stop the consumers, this complication is because of the non-blocking receive of consumer
double start_time, end_time, total_time; // variable for calculating time

double MAX_TIME = 2; // maximum time the simulation will run

MPI_Status status;
MPI_Request request;

// Check whether buffer is free or not
int buffer_free() {
if(buffer_item_count == BUFFER_SIZE) {
return 0;
return 1;

// Check whether buffer is empty or not
int buffer_empty() {
if(buffer_item_count > 0) {
return 0;
return 1;

// Broker process body
void *broker(int world_rank, int* buffer) {
start_time = MPI_Wtime();

int number;
int time_out_recommended = 0;

int inserted = 0;
int total_consumed = 0;

// queue of waiting producers
int *prod_waiting;
prod_waiting = malloc(sizeof(int) * nprod);
memset(prod_waiting, 0, sizeof(int) * nprod); // no producer is waiting

while(!time_out) {
// if buffer free in broker, receive work msg from produer
if (buffer_free()) {
MPI_Recv(&number, 1, MPI_INT, MPI_ANY_SOURCE , WORK_MSG, MPI_COMM_WORLD, &status);
// if buffer is not empty in broker, receive work request msg from consumer
else if(!buffer_empty()) {

int source_rank = status.MPI_SOURCE;
int msg_tag = status.MPI_TAG;

// block for message handling starts
// manage the work message from the producer
if (msg_tag == WORK_MSG) {
if (buffer_free()) { // extra checking for buffer_free
// insert the received number from the producer into the buffer
buffer[tail] = number;
// increase the count for items in buffer
buffer_item_count++; // increase buffer item count
//printf("Broker-Prod. Inserted number %d(tag=%d) from producer = %d into buffer position %d\n", number, msg_tag, source_rank, tail);
tail = (tail + 1) % BUFFER_SIZE;
// send ack msg to the producer
//printf("Broker-Prod. Sending ACK MSG for number %d to prod %d.\n", number, source_rank);
MPI_Send(&number, 1, MPI_INT, source_rank, ACK_MSG, MPI_COMM_WORLD);
else {
// if buffer is not free put the producer in the waiting producer queue
prod_waiting[source_rank] = 1;
//printf("Broker-Prod-Waiting. Buffer item count = %d\n", buffer_item_count);
//printf("Broker-Prod-Waiting. prod %d waiting for ACK MSG.\n", source_rank);
// manage the work request msg for the consumers
else if (msg_tag == REQ_WORK_MSG) {
if (!buffer_empty()){ // if buffer not empty, consume (extra checking)
// extract the number for the consumer
number = buffer[head];
// increase the count for items in buffer
//printf("Broker-Cons. Extracted number %d(tag=%d) for consumer = %d from buffer position %d\n", number, msg_tag, source_rank, head);
head = (head + 1) % BUFFER_SIZE;
// send the extracted number to consumer
//printf("Broker-Cons. Sending extracted number %d to cons %d.\n", number, source_rank);
MPI_Send(&number, 1, MPI_INT, source_rank, REQ_WORK_MSG, MPI_COMM_WORLD);
// send the waiting producers ack msg
int i;
for (i=0; i < nprod; i++) { if (prod_waiting[i] == 1) { buffer[tail] = number; buffer_item_count++; // increase buffer item count tail = (tail + 1) % BUFFER_SIZE; // set the producer to be non-waiting prod_waiting[i] = 0; // send the ack msg to the producer //printf("Broker-Cons-Waiting. Sending ACK MSG for number %d to prod %d\n", number, i); MPI_Send(&number, 1, MPI_INT, i, ACK_MSG, MPI_COMM_WORLD); } } } } // block for message handling ends // block for time out checking starts end_time = MPI_Wtime(); total_time = end_time - start_time; if (total_time > MAX_TIME) {
time_out = 1;
//printf("Broker. Time out. \n");

if (time_out == 1) {
int i;
number = -1;
for(i=0; i < world_size; i++) {
// receive consumed count from all the consumers
for(i=2; i < world_size; i += 2) {
int consumed_count = 0;
MPI_Recv(&consumed_count, 1, MPI_INT, i, CONS_MSG, MPI_COMM_WORLD, &status);
total_consumed += consumed_count;

printf("Broker. Total time = %lf s\n", end_time - start_time);
printf("Broker. Total inserted = %d \n", inserted);
printf("Broker. Total consumed = %d \n", total_consumed);

//printf ("Broker. Finished. \n");

return NULL;

// producer process body
void *producer(int world_rank) {
int number;

srand(time(NULL) + world_rank);
while(!time_out) {
// producing random number
number = rand() % 100 + 1;
//printf("Prod. Produced random number %d by prod %d \n", number, world_rank);

// blocking wait for acknowledgement message

int msg_tag = status.MPI_TAG;
if(msg_tag == ACK_MSG) {
// printf("Prod. Received ACK message for number %d by prod %d\n", number, world_rank);
else if (msg_tag == TIME_OUT_MSG) {
//printf("Prod. Time out for prod %d. \n", world_rank);
time_out = 1; // break out of loop

//printf("Prod %d. Out of while loop.\n", world_rank);
printf("Prod %d. Production finished.\n", world_rank);

return NULL;

// consumer process body
void *consumer(int world_rank) {
int number;
int consumed_count = 0;

while(!time_out) {
// Sending work request msg to broker
//printf("Cons. Sent work req msg to broker by cons %d.\n", world_rank);

int source_rank, msg_tag;

// non-blocking receive for consuming the number
MPI_Irecv(&number, 1, MPI_INT, BROKER_RANK, MPI_ANY_TAG, MPI_COMM_WORLD, &request);
// testing until receive occurs
while (1) {
int flag = 0;
MPI_Test(&request, &flag, &status);
if (flag != 0) {
source_rank = status.MPI_SOURCE;
msg_tag = status.MPI_TAG;
//printf("Cons. Consumed the work req msg (number %d) by cons %d\n", number, world_rank);
else {
//MPI_Irecv(&number, 1, MPI_INT, BROKER_RANK, MPI_ANY_TAG, MPI_COMM_WORLD, &request);

if (source_rank == BROKER_RANK) {
// if the request is done
if (request == MPI_REQUEST_NULL) {
//printf("The receive request is null now. \n");
if(msg_tag == REQ_WORK_MSG) {
//printf("Cons. Consumed random number %d by cons = %d \n", number, world_rank);
} else if (msg_tag == TIME_OUT_MSG) {
//printf("Cons. Time out for cons %d. \n", world_rank);
time_out = 1; // break out of loop
//printf("Cons %d. Out of while loop.\n", world_rank);
printf("Cons %d. Consumption finished.\n", world_rank);

return NULL;

int main(int argc, char** argv) {
int i;


MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

int w_s = sqrt((double)world_size);

if (world_size % 2 != 0 || w_s * w_s != world_size || world_size < 4) { printf("Process number incorrect. \n"); MPI_Finalize(); return 0; } if (argc > 0) { // You can pass max time as the first argument
MAX_TIME = atoi(argv[1]);

if (world_rank == 0) {// broker processor
nprod = world_size / 2;
ncons = world_size / 2 - 1;
BUFFER_SIZE = 2 * world_size / 2;
buffer = (int *)malloc(sizeof(int) * BUFFER_SIZE);

head = tail = 0;
broker(world_rank, buffer);

} else if (world_rank % 2 != 0){// Producer processors, odd world rank
} else {// Consumer processors, even world rank


return 0;

Producer Consumer Thread Programming

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
#include "lifo_queue.h"
#include "rwlock.h"

#define MAX_THREADS 256

pthread_rwlock_t read_write_lock;

int num_threads;
int *buffer;

int ins[MAX_THREADS]; // insertions
int ext[MAX_THREADS]; // extractions
int total_insertions;
int total_extractions;
double ins_ext_per_sec;

Queue q;

/* other shared data structure */
struct task {
int n;

int doneProducing(int ins) {
if(ins >= 10 * QUEUE_SIZE/num_threads)
return 1;

return 0;

int doneConsuming(int ext) {
if(ext >= 10 * QUEUE_SIZE/num_threads)
return 1;

return 0;

void create_task(int thread_no, int ins, struct task* t) {
t->n = thread_no + ins * num_threads;

void process_task(struct task t) {

if (ins % 100 == 0) {
printf("Producer Thread no %d:\n", thread_no);
printf ("After %d insertions\n", ins);

void *producer(void *producer_thread_data) {
int thread_no, *ins_pointer;
int inserted;
int ins;

struct task my_task;

ins_pointer = (int *) producer_thread_data;
thread_no = *ins_pointer;
ins = 0;

while(!doneProducing(ins)) {
inserted = 0;
create_task(thread_no, ins, &my_task);

while(inserted == 0) {
while (isFull(&q)) {

if (!isFull(&q)) {
//printf("\nProducer thread no %d:\n", thread_no);
insertIntoQueue(&q, my_task.n);
inserted = 1;
ins ++;

*ins_pointer = ins;

return NULL;

void *consumer(void *consumer_thread_data) {
int thread_no, *ext_pointer;
int extracted;
int ext;

struct task my_task;

ext_pointer = (int *) consumer_thread_data;
thread_no = *ext_pointer;

ext = 0;

while (!doneConsuming(ext)) {
extracted = 0;
while(extracted == 0) {
while (isEmpty(&q)) {

if (!isEmpty(&q)) {
//printf("\nConsumer thread no %d:\n", thread_no);
my_task.n = extractFromQueue(&q);
extracted = 1;

*ext_pointer = ext;

return NULL;

int main() {
int i;
struct timeval tz;
struct timezone tx;

double start_time, end_time;
pthread_t p_threads_broker;
pthread_t p_threads_prod[MAX_THREADS];
pthread_t p_threads_cons[MAX_THREADS];
pthread_attr_t attr;

pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);

//pthread_rwlock_init(&read_write_lock, NULL);

printf("Enter number of threads (use either of 2, 4, 8, 16, 32, 64, 128, 256): ");
scanf("%d", &num_threads);

int n_proc = num_threads;
int n_prods = n_proc / 2;
int n_cons = n_prods - 1;
int n_broker = 1;

buffer = (int *) malloc(sizeof(int) * 2 * num_prods)

gettimeofday(&tz, &tx);
start_time = (double)tz.tv_sec + (double) tz.tv_usec / 1000000.0;

//init(&q); //initializing task_queue

// create broker thread
p_threds_broker = pthread_create(&p_threads_broker, &attr, broker, (void *));

for (i=0; i<n_prods; i++) {
ins[i] = i;
// create producer threads
pthread_create(&p_threads_prod[i], &attr, producer, (void *) &ins[i]);

for (i=0; i<n_cons; i++) {
ext[i] = i;
// create consumer threads
pthread_create(&p_threads_cons[i], &attr, consumer, (void *) &ext[i]);

for (i=0; i<n_prods; i++) {
// join producer threads and add up all the insertions
pthread_join(p_threads_prod[i], NULL);
total_insertions += ins[i];

for (i=0; i<n_cons; i++) {

// join consumer threads and add up all the extractions
pthread_join(p_threads_cons[i], NULL);
total_extractions += ext[i];

gettimeofday(&tz, &tx);
end_time = (double) tz.tv_sec + (double) tz.tv_usec / 1000000.0;

ins_ext_per_sec = (total_insertions + total_extractions) / (end_time - start_time);
printf("Producer/Consumer threads = %d\n", num_threads);
printf("Total insertions = %d\n", total_insertions);
printf("Total extractions = %d\n", total_extractions);
printf("Total time = %lf s\n", end_time - start_time);
printf("Throughput (# of insertions/extractions ) = %lf per sec\n", ins_ext_per_sec);

return 0;