Statistics and Data Exploration: Quantiles, probability distribution, Box plot and Q-Q (Quantile-Quantile) plot

Statistics and Data Exploration: Quantiles, probability distribution, Box plot and Q-Q (Quantile-Quantile) plot

Advertisements

Quantiles

What are quantiles in statistics?

If the data is sorted from small to big, Quantiles are the points which divide the data/samples into equal sized, adjacent subgroups. Every data sample has maximum value, minimum value, median value(the middle value after you sort the data). The middle value in the sorted data is the 50% quantile because half of the data are below that point and half above that point. A 25% quantile is the cut point in the data where 1/4 -th of the data is below the point. IQR is inter-quartile range which contains half of the data which contains the median and are higher than the 25% low-value data point but less than the 25% high-value data point.

Box Plot

A box-plot can be a good representation to show the quantiles. Box plot can take different shapes depending on the data. Here is an example:

Screen Shot 2018-04-16 at 10.05.34 AM

(image source: www.physics.csbsju.edu/stats/box2.html)

Example of Discrete/Continous Probability Distribution

In the figure below, you can see different frequency distribution. The blue data samples have most of it’s data near (0,1) interval, it’s left skewed. Check how the blue box is shifted to the left. The green data samples are normally distributed, meaning most of the data points are centered around zero. It also looks balanced. We find normal distribution in nature and in biological and social phenomena very often. The orange one shows almost a uniform distribution, where the data is spreaded across the range. And lastly a right skewed data. These are all discrete data points with discrete probability distribution. There are also very well known continuous probability distribution with continuous probability density function(https://en.wikipedia.org/wiki/List_of_probability_distributions#Continuous_distributions).

Screen Shot 2018-04-16 at 10.06.55 AM

  (Image source: https://www.otexts.org/node/620)

Below we can see the quantiles for the normal distribution- the cut points which divide the continuous range of points in equal probability area. The area over an interval (in x axis) under a continuous probability density function (like the normal distribution function below) represents the probability of the data falling into that range. In this case, the IQR is the blue box; data point in that interval has 50% probability of occurrence.

Screen Shot 2018-04-16 at 9.57.22 AM

Q-Q plot

We can use Q-Q plot to graphically compare two probability distributions. Q-Q plot stands for Quantile vs. Quantile plot. In Q-Q plotting, we basically compute the probabilities assuming a certain distribution (e.g. normal, gamma or poisson distribution) from the data and then compare it with theoritical quantiles. The steps used in Q-Q plotting is:

  1. Sort the data points from small to large
  2. For n data points, find n equally spaced points which serve as the probability using \frac{k}{n+1} where k=1, 2, ..., n
  3. Look at the data points, possibly plot it and assume the underlying probability distributions. Using the probabilities from the step 2, now you can calculate quantiles. Like in R language, you can use the quantile functions like qnorm or qgamma or qunif from the stats package.
  4. Now plot by putting the calculated quantiles in step 3 in x axis and putting the sorted data points in the y-axis. If the data points stay close to the y=x line, that means your assumption of the probability distribution was correct.

Below you can see one example, where the normal distribution is assumed for the ozone data. T

Screen Shot 2018-04-16 at 10.35.57 AM

Now you can see the gamma distribution fits better to the ozone data than the normal distribution.

Screen Shot 2018-04-16 at 10.37.45 AM

This is how you can check different probability distribution for your data using simple Q-Q plot. There is a fantastic Q-Q plot tutorial from which I collected the above image. For further reading, please check https://www.r-bloggers.com/exploratory-data-analysis-quantile-quantile-plots-for-new-yorks-ozone-pollution-data/ and https://www.r-bloggers.com/exploratory-data-analysis-quantile-quantile-plots-for-new-yorks-ozone-pollution-data/

 

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s