Describe your data using Python

world-of-statistics
This is the first of a series of articles that I will write to give a gentle introduction to statistics. In this article we will introduce some basic statistical concepts and learn how to use basic statistics to help you describe your data.

We will cover the following topics in this article:

The difference between a population and a sample
The difference between Descriptive and Inferential statistics
Different types of variables
Types of descriptive statistics
Normal or Gaussian distribution

The difference between a population and a sample:

Population denotes a large group consisting of elements having at least one common feature; it is the complete set of observations
Sample is a finite subset of the population; it is a subset of observations from a population. We get a sample from the population in either of the following ways
- Representative sampling - here the sample’s characteristics are similar to the population characteristics
  - A simple random sample is the most common approach to obtain a representative sample
  - A systematic random sample
  - A cluster random sample
  - A stratified random sample
- Convenience sampling - here we collect sample from section of population that is easily available

The difference between Descriptive and Inferential statistics:
types-of-statistics

Descriptive statistics - its all about organizing, describing and summarizing data
- Exploratory data analysis (EDA)
  - measures of location - such as Mean, Median, Mode
  - measures of variability or dispersion - such as Variance, Standard deviation, Range, Inter quartile range (IQR)
Inferential statistics - its all about drawing conclusions about a population from analysis of a random sample drawn from the populaiton
- Exploratory modelling - how is x related to y?
- Predictive modelling - if you know x, can you predict y?

Different types of variables:
types-of-variables

Quantitative
- Discrete: a variable whose value is obtained by counting. Example, number of students in a class
- Continuous: a variable whose value is obtained by measuring. Example, height of all students in a class
  - Interval: this is scale of measurement where continuous data is rank ordered
  - Ratio: this is scale of measurement where continuous data is rank ordered + has meaningful spacing
Qualitative or Categorical
- Nominal: example gender - female or male
- Ordinal: example size - small, medium, or large

Types of descriptive statistics:
formula-mean

Measures of location: mainly measures of central tendency
- Mean: sum of all values divided by the number of values
```
import seaborn as sns
tips = sns.load_dataset('tips')
tips.mean() # shows mean of all numeric variables
```
- Median: middle value in a given sequence of values ordered by rank
```
tips.median() # shows median of all numeric variables
```
- Mode: most frequent value in a set of values
```
tips.mode() # shows mode of all variables
```
Measures of variability, spread or dispersion
- Range: Maximum value - Minimum value
```
range = tips.total_bill.max() - tips.total_bill.min() # range
```
- IQR (Inter quartile range): 75th percentile - 25th percentile
```
tips.total_bill.quantile(.75) - tips.total_bill.quantile(.25) # IQR
```
- Variance: Measure of variability of data around the mean
```
tips.total_bill.var() # variance of total_bill variable
```
- Standard deviation: how spread out the data is, i.e. how much variance there is from the mean
```
tips.total_bill.std() # standard deviation of total_bill variable
```
- Coefficient of variance (C.V.): measure of standard deviation expressed as a percentage of the mean
```
cv = lambda x: x.std() / x.mean() * 100
cv(tips.total_bill)
```
Measures of symmetry and peakedness: Skewness measures symmetry and Kurtosis measures peakedness

Normal or Gaussian distribution
formula-mean

This is one of the most common statistical distribution. The curve of this distribution is shaped like a bell.
The shape of the bell depends on mean and standard deviation of the data
Larger the standard deviation, wider the distribution
A tip to quickly assess normality is to see if mean and median are nearly equal

Skewness and Kurtosis

Skewness measures tendency of data to be spread out on one side of the mean than the other. Skewness value indicates
- Negative value indicates the data is left skewed
- Positive value indicates the data is right skewed
- Closer to zero for the data to be normally distributed
```
import scipy.stats as s
s.skew(tips.total_bill, bias=False)  #calculate sample skewness
```
Kurtosis measures tendency of data to be concentrated around the center or tails. Kurtosis value indicates
- Platykurtic: Negative value indicates lower than normal peakedness
- Leptokurtic: Positive value indicates higher than normal peakedness
- Mesokurtic: Closer to zero for the data to be normally distributed
```
import scipy.stats as s
s.kurtosis(tips.total_bill, bias=False)  #calculate sample kurtosis
```

Comments welcome!

parashar.ca

Contact