\frac{ \sum_{k=1}^{N}(y_k – \bar{y} ) [sin(2\pi f (t_k-\tau))]^2}{\sum_{k=1}^{N}cos^2(2\pi f (t_k-\tau))}$With evenly spaced data, two signals of different frequencies can have identical values which is known as Aliasing. That is why the classic periodogram is usually shown with the frequency range from 0 to 0.5, as the rest is a mirrored version. But with Lomb-Scargle periodogram, the aliasing effect can be significantly reduced. ## Maximum Likelihood Analysis In maximum likelihood method, we adjust the parameter of the model and ultimately find the parameters with which our model have the maximum probability/likelihood of generating the data. To estimate the spectral power, we first select a false alarm probability and calculate the normalized periodogram. We identify the maximum peak and test it against the false alarm probability. If the maximum peak meets the false alarm test, we determine the amplitude and phase of the sinusoid representing the peak. Then we subtract the sinusoidal curve from the data which also removes the annoying sidelobes associated with that peak. After peak removal, the variance in the total record is also reduced. Now, with the new subtracted data, we continue finding the other stronger peaks following the same procedure. We stop when a peak does not meet the false alarm test. We need to carefully choose the false alarm probability, as if it is too low, we can miss some significant peaks; it is too low, we can mislabel noise as peaks. ## Maximum Entropy Method It is assumed that the true power spectrum can be approximated by an equation which has a power series. This method finds the spectrum which is closest to white noise (has the maximum randomness or “entropy”) while still having an autocorrelation function that agrees with the measured values – in the range for which there are measured values. It yields narrower spectral lines. This method is suitable for relatively smooth spectra. With noisy input functions, if very high order is chosen, there may occur spurious peaks. This method should be used in conjuction with other conservative methods, like periodograms, to choose the correct model order and to avoid getting false peaks. ## Cross Spectrum and Coherency If a climate proxy $a(t)$ is influenced or dominated by a driving force $b(t)$, we can use cross spectrum to see if their amplitudes are similar. Cross spectrum is given by the product of the fourier transform. $C(f) = A(f) B^*(f)$ where A is the Fourier transform of a and B is the complex conjugate of the fourier transform of b. If we want to know whether two signals are in phase with each other, regardless of amplitude, then we can take the cross spectrum, square it, and divide by the spectral powers of individual signals using the following equation for coherency. Coherency measures only the phase relationship and is not sensitive to amplitude which is a big drawback. $c(f) = \frac{|C(f)|^2}{P_a(f) P_b(f)}$ Coherency is valuable if two signals that are varying in time, stay in phase over a band of frequencies instead of a single frequency. Therefore, a band of adjacent frequancies are used in the averaging process to compute coherency: $coherency(f) = \gamma^2(f) = \frac{||^2}{ }$ ## Bispectra In bispectra, coherency relationship between several frequencies are used. A bispectrum shows a peak whenever (1) three frequencies $f_1$, $f_2$ and $f_3$ are present in the data such that$f_1 + f_2 = f_3\$ and (2) the phase relationship between the three frequencies is coherent for at least a short averaging time for a band near these frequencies. If the nonlinear processes in driving force (e.g. eccentricity or inclination of the orbit of earth) has coherent frequency triplets, then the response (i.e. climate) is likely to contain same frequency triplet. For example, $\delta ^{18}O$ is driven by eccentricity, we should be able to find eccentricity triplet. Thus, by comparing the bispectrum plot of climate proxy with the bispectrum plot of the driving forces, we can verify the influences of driving forces.

## Monte Carlo Simulation of Background
Monte carlo simulation is extremely useful to answer the questions like whether the data is properly tuned or not, whether the timescale is incorrect, whether some spectral power is being leaked to adjacent frequencies, whether the peak has real structure and also to understand the structures near the base of the peak (a shoulder) in a spectral analysis. Generally monte carlo simulation is run multiple times. For each simulation, a real signal(sinusoidal wave) is generated, then random background signal is added, then the spectral power is calculated to look for shoulders. In this way, the frequency of the shoulder occurence can be measured and the randomness can be realized. It is important to create background that behaves similarly to the background in real data. Dissimilar background will cause false conclusion. We also need to estimate the statistical significance of the peaks very carefully.

(This article is a quick review of the fourier spectral analysis from the book “Ice Ages And Astronomical Causes- (Data, Spectral Analysis and Mechanics) by Richard A. Muller and Gordon J. MacDonald

## Statistics and Data Exploration: Quantiles, probability distribution, Box plot and Q-Q (Quantile-Quantile) plot

Statistics and Data Exploration: Quantiles, probability distribution, Box plot and Q-Q (Quantile-Quantile) plot

### Quantiles

What are quantiles in statistics?

If the data is sorted from small to big, Quantiles are the points which divide the data/samples into equal sized, adjacent subgroups. Every data sample has maximum value, minimum value, median value(the middle value after you sort the data). The middle value in the sorted data is the 50% quantile because half of the data are below that point and half above that point. A 25% quantile is the cut point in the data where 1/4 -th of the data is below the point. IQR is inter-quartile range which contains half of the data which contains the median and are higher than the 25% low-value data point but less than the 25% high-value data point.

### Box Plot

A box-plot can be a good representation to show the quantiles. Box plot can take different shapes depending on the data. Here is an example: ### Example of Discrete/Continous Probability Distribution

In the figure below, you can see different frequency distribution. The blue data samples have most of it’s data near (0,1) interval, it’s left skewed. Check how the blue box is shifted to the left. The green data samples are normally distributed, meaning most of the data points are centered around zero. It also looks balanced. We find normal distribution in nature and in biological and social phenomena very often. The orange one shows almost a uniform distribution, where the data is spreaded across the range. And lastly a right skewed data. These are all discrete data points with discrete probability distribution. There are also very well known continuous probability distribution with continuous probability density function(https://en.wikipedia.org/wiki/List_of_probability_distributions#Continuous_distributions). ###### (Image source: https://www.otexts.org/node/620)

Below we can see the quantiles for the normal distribution- the cut points which divide the continuous range of points in equal probability area. The area over an interval (in x axis) under a continuous probability density function (like the normal distribution function below) represents the probability of the data falling into that range. In this case, the IQR is the blue box; data point in that interval has 50% probability of occurrence. ### Q-Q plot

We can use Q-Q plot to graphically compare two probability distributions. Q-Q plot stands for Quantile vs. Quantile plot. In Q-Q plotting, we basically compute the probabilities assuming a certain distribution (e.g. normal, gamma or poisson distribution) from the data and then compare it with theoritical quantiles. The steps used in Q-Q plotting is:

1. Sort the data points from small to large
2. For n data points, find n equally spaced points which serve as the probability using $\frac{k}{n+1}$ where $k=1, 2, ..., n$
3. Look at the data points, possibly plot it and assume the underlying probability distributions. Using the probabilities from the step 2, now you can calculate quantiles. Like in R language, you can use the quantile functions like qnorm or qgamma or qunif from the stats package.
4. Now plot by putting the calculated quantiles in step 3 in x axis and putting the sorted data points in the y-axis. If the data points stay close to the $y=x$ line, that means your assumption of the probability distribution was correct.

Below you can see one example, where the normal distribution is assumed for the ozone data. T Now you can see the gamma distribution fits better to the ozone data than the normal distribution. This is how you can check different probability distribution for your data using simple Q-Q plot. There is a fantastic Q-Q plot tutorial from which I collected the above image. For further reading, please check https://www.r-bloggers.com/exploratory-data-analysis-quantile-quantile-plots-for-new-yorks-ozone-pollution-data/ and https://www.r-bloggers.com/exploratory-data-analysis-quantile-quantile-plots-for-new-yorks-ozone-pollution-data/

## Chance

It’s 2018. The first book to try to formulate probability/chance was by Abraham de Moivre in exactly 300 years ago in 1718.

If you roll two dices, can you get two numbers that sum to 14? No. Coz the maximum number in each dice is 6. But can you get a number 11? Yes. That’s probable. How much probable? Can we measure? These are some basic probability that every science student learns in high school. But the underlying philosophy here is striking:
“Not everything is equally probable. Some are more likely than others. Some arguments, ideas are better than others, but based on some established ideas or agreed upon assumptions. How do we measure?”

What are your chances of waking up tomorrow? Is it 100%? What’s the chance that your relationship will last? Are these even measurable? What does it mean when you say I hope? Does that automatically translate into a highly probable future? Why don’t you get that million dollar lottery? Can you win some bucks at those Vegas gambling playing poker or blackjack? 😉 What’s your chance that you will win the game? It wasn’t that easy for mathematicians and statisticians to formulate what it really means by what’s probable,what’s expected in the world of uncertainty we live in and deal with. And in science, who doesn’t deal with some basic probability equations in their data.
.

I absolutely admire the talk by Dr. Ana at Lawrence Livermore National Lab on “Understanding the world through statistics.” who introduced me to this book.

“The best thing about being a statistician is that you get to play around everyone’s backyard.” – The great statistician John Tukey.

May be I can make a new phrase for CS programmer too..haha.

“The best thing about being a computer programmer is that you get to make toys for everyone.” LoL.

Basketball statistics and why Stephen Curry attempts more 3 point shots than 2 point shots:

## Looking Inside Eminem’s Lyrics – part 1

I started analyzing the lyrics of Eminem. My initial interest is that what are the most common words Eminem has been using in his lyrics. I collected the name of all his 287 songs from this link. Then I collected the lyrics using python sontext library which collects lyrics from lyricwiki.org. I have become successful in gathering lyrics for 223 songs of Eminem out of those 287 song names using my python code. After gathering the lyrics, I had to process the text in the lyrics. Normally in language processing tasks, we do chunking, lemmatization, stemming, spelling error check. I have used NLTK library for all these. Actually I had to avoid doing lemmatization as it was chopping off lots of interesting words in its existing form. And also I created a banned words list as Eminem has used a lot of ‘na’, ‘wa’, ‘oh’ kind of words which semantically doesn’t have much meaning.  Then I used NLTK word frequency method to find out the frequency of words. The top 20 words used were

[(u’like’, 1375), (u’get’, 1049), (u’got’, 907), (u’shit’, 740), (u”’cause”, 729),

(u’know’, 701), (u’back’, 674), (u’fuck’, 671), (u’eminem’, 593), (u’go’, 557),

(u’see’, 514), (u’one’, 497), (u’say’, 476), (u’never’, 430), (u’bitch’, 428),

(u’man’, 428), (u’let’, 422), (u’time’, 411), (u’come’, 392), (u’think’, 361)]

And yeah apparently Eminem has cursed a lot in his songs. As you can see in the plot below for the word “shit” (rank:4), “fuck” (rank:8), “bitch” (rank:15).

Then the word ‘love’ has been used 282 times just bit less than the word ‘ass’ which was used 295 times. You can see the word ‘dre’ has been used a lot and it’s most likely Dr. Dre who worked with Eminem. The word ‘man’ is used more than the word ‘girl’. The word ‘hate’ is used less than the word ‘love’, only 116 times. Here’s two more plots for word frequency.

Anyway, simple bag of words probably don’t give good representation of a particular song. For example, the word love can be used in a sentence “I love you” but then also “I don’t love you” which has completely opposite meaning. But here they are counted all together. Before contextual analysis, I was just thinking about doing another frequency analysis according to Russel’s model of mood. You basically divide the xy-plane into four orthogonal regions as you can see in the image below.

I want to see where eminem’s music in general fall in this emotional plane. There’s more interesting analysis I can do later on using word vector and other new NLP techniques. I’ll eventually look into other artists, other genres and try to find whether there are different patterns in how the words are chosen and the kind of emotion certain songs may generate.

Code for getting Lyrics:

import lxml
from lxml import html
import requests

import pickle

import numpy as np
import libsongtext
from libsongtext import lyricwiki
from libsongtext import utils

import pprint as pp

artist_name = 'eminem'

url = 'http://www.spin.com/2014/10/eminem-every-song-ranked/'
#f = urllib.urlopen(url)
f = requests.get(url)

tree = html.fromstring(html_page)

song_name_xpath=tree.xpath('//div[@class="article-content clearfix"]//strong/a')
song_names=[]

num = 1
lyrics_list = {}
lyrics_not_found_list = []
success_lyrics_cnt = 0
for s in song_name_xpath:
song = ''.join(s.text.encode('ascii', 'ignore').strip())
song_names.append(song)

print 'No. ' + str(num)
num = num + 1
print 'track : ' + song

try:
args = {}
args['artist'] = artist_name
args['title'] = song.strip()
title = args['title']
if not lyrics_list.get(title):
t = lyricwiki.LyricWikiSong(args)
lyrics = t.get_lyrics()
print "Got Lyrics."
lyrics_list[title] = lyrics
success_lyrics_cnt += 1
except:
lyrics_not_found_list.append(song)
print "Failed to get Lyrics."

print 'Successfully got ', success_lyrics_cnt, ' lyrics out of ', len(song_name_xpath), ' tracks'

def save_obj(obj, path, name):
with open(path + name + '.pkl', 'wb') as f:
pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

with open(path + name + '.pkl', 'rb') as f:

save_obj(lyrics_list, '/Users/andy/Documents/projects/music/lyrics_database/eminem/', 'eminem_song_lyrics')



Code for word frequency analysis in Lyrics:

import pickle
import string

import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk.corpus import stopwords

import enchant
from enchant.checker import SpellChecker

eng_dict = enchant.Dict("en_US")

#import lyrics of eminem
eminem_lyrics_pickle_file='/Users/andy/Documents/projects/music/lyrics_database/eminem/eminem_song_lyrics.pkl'
f = open(eminem_lyrics_pickle_file, 'rb')

song_names=lyrics_list.keys()
lyrics= lyrics_list.values()

# english words
#words = set(nltk.corpus.words.words())

#stemmer
porter = nltk.PorterStemmer()
wnl = nltk.WordNetLemmatizer()

def plot_freqdist_freq(fd,
max_num=None,
cumulative=False,
title='Frequency plot',
linewidth=2):
"""
As of NLTK version 3.2.1, FreqDist.plot() plots the counts
and has no kwarg for normalising to frequency.
Work this around here.

INPUT:
- the FreqDist object
- max_num: if specified, only plot up to this number of items
(they are already sorted descending by the FreqDist)
- cumulative: bool (defaults to False)
- title: the title to give the plot
- linewidth: the width of line to use (defaults to 2)
OUTPUT: plot the freq and return None.
"""

tmp = fd.copy()
norm = fd.N()
for key in tmp.keys():
tmp[key] = float(fd[key]) / norm

if max_num:
tmp.plot(max_num, cumulative=cumulative,
title=title, linewidth=linewidth)
else:
tmp.plot(cumulative=cumulative,
title=title,
linewidth=linewidth)

return

stem_tokens = ['ed', 'ies', '\'s' , 'n\'t', '\'m', '--', '\'\'']
banned_words = ['ha', 'wa', 'ta', 'u', 'i', 'ai', 'na', 'ca', '...', '..', '\'em', '\'en', 'wan', '', '',
'oh', 're', '\'re', '\'ne', 'yea', 'yeah', 'ya', 'yah', '\'ve', '\'d', 'wo', 'oh', 'ooh',
'\'ll', 'yo', 'is\\u2026', 'ah', 'wit', 'would', '\\u2019']

#['i\'ma', 'y\'ll']

def synonyms(word):
syns = []
for word in wn.synsets(word):
sim_words = word.similar_tos()
sim_words += word.lemma_names()
for sim in sim_words:
s = sim
if hasattr(s, '_name') :
s = sim._name.split(".")
syns.append(s)

syns = set(syns)
return syns

def stem(word):
for suffix in stem_tokens:
if word in banned_words:
return False

if word == 'suffix' or word.endswith(suffix):
return word[:-len(suffix)]
return word

lyrics_edited = []
chkr = SpellChecker("en_US")

edited_tokens = []
i = 1
for s, l in lyrics_list.items():
print i, ". Processing song: \"", s, "\""
i += 1
# find wrongly spelled words
chkr.set_text(l)
err_words=[]
for err in chkr:
err_words.append(err.word)

#tokenize
tokens = word_tokenize(l)
l_txt = nltk.Text(tokens)

for t in tokens:
tn = t.lower()
#tn = porter.stem(t)
#tn = wnl.lemmatize(tn)

tn = stem(tn)
if tn and tn not in err_words and tn not in stopwords.words('english') and tn not in list(string.punctuation):
edited_tokens.append(tn)

uniq_tokens = set(edited_tokens)

fdist = nltk.FreqDist(edited_tokens)

#Rusell's Model of mood
mood_happy_words = ['Exhilarated', 'Excited', 'Happy', 'Pleasure']
mood_h = []
for ws in mood_happy_words:
for w in synonyms(ws):
mood_h.append(w)

mood_h = list(set(mood_h))

mood_angry_words = ['Anxious', 'Angry', 'Terrified', 'Disgusted']
mood_a = []
for ws in mood_angry_words:
for w in synonyms(ws):
mood_a.append(w)

mood_a = list(set(mood_a))

mood_s = []
for w in synonyms(ws):
mood_s.append(w)

mood_s = list(set(mood_a))

mood_relaxed_words = ['Relaxed', 'Serene', 'Tranquil', 'Calm']
mood_r = []
for ws in mood_relaxed_words:
for w in synonyms(ws):
mood_r.append(w)

mood_r = list(set(mood_r))

`