# Statistics and Data Exploration: Quantiles, probability distribution, Box plot and Q-Q (Quantile-Quantile) plot

Statistics and Data Exploration: Quantiles, probability distribution, Box plot and Q-Q (Quantile-Quantile) plot

Advertisements

### Quantiles

What are quantiles in statistics?

If the data is sorted from small to big, Quantiles are the points which divide the data/samples into equal sized, adjacent subgroups. Every data sample has maximum value, minimum value, median value(the middle value after you sort the data). The middle value in the sorted data is the 50% quantile because half of the data are below that point and half above that point. A 25% quantile is the cut point in the data where 1/4 -th of the data is below the point. IQR is inter-quartile range which contains half of the data which contains the median and are higher than the 25% low-value data point but less than the 25% high-value data point.

### Box Plot

A box-plot can be a good representation to show the quantiles. Box plot can take different shapes depending on the data. Here is an example:

### Example of Discrete/Continous Probability Distribution

In the figure below, you can see different frequency distribution. The blue data samples have most of it’s data near (0,1) interval, it’s left skewed. Check how the blue box is shifted to the left. The green data samples are normally distributed, meaning most of the data points are centered around zero. It also looks balanced. We find normal distribution in nature and in biological and social phenomena very often. The orange one shows almost a uniform distribution, where the data is spreaded across the range. And lastly a right skewed data. These are all discrete data points with discrete probability distribution. There are also very well known continuous probability distribution with continuous probability density function(https://en.wikipedia.org/wiki/List_of_probability_distributions#Continuous_distributions).

###### (Image source: https://www.otexts.org/node/620)

Below we can see the quantiles for the normal distribution- the cut points which divide the continuous range of points in equal probability area. The area over an interval (in x axis) under a continuous probability density function (like the normal distribution function below) represents the probability of the data falling into that range. In this case, the IQR is the blue box; data point in that interval has 50% probability of occurrence.

### Q-Q plot

We can use Q-Q plot to graphically compare two probability distributions. Q-Q plot stands for Quantile vs. Quantile plot. In Q-Q plotting, we basically compute the probabilities assuming a certain distribution (e.g. normal, gamma or poisson distribution) from the data and then compare it with theoritical quantiles. The steps used in Q-Q plotting is:

1. Sort the data points from small to large
2. For n data points, find n equally spaced points which serve as the probability using $\frac{k}{n+1}$ where $k=1, 2, ..., n$
3. Look at the data points, possibly plot it and assume the underlying probability distributions. Using the probabilities from the step 2, now you can calculate quantiles. Like in R language, you can use the quantile functions like qnorm or qgamma or qunif from the stats package.
4. Now plot by putting the calculated quantiles in step 3 in x axis and putting the sorted data points in the y-axis. If the data points stay close to the $y=x$ line, that means your assumption of the probability distribution was correct.

Below you can see one example, where the normal distribution is assumed for the ozone data. T

Now you can see the gamma distribution fits better to the ozone data than the normal distribution.

This is how you can check different probability distribution for your data using simple Q-Q plot. There is a fantastic Q-Q plot tutorial from which I collected the above image. For further reading, please check https://www.r-bloggers.com/exploratory-data-analysis-quantile-quantile-plots-for-new-yorks-ozone-pollution-data/ and https://www.r-bloggers.com/exploratory-data-analysis-quantile-quantile-plots-for-new-yorks-ozone-pollution-data/