How to Use One Way ANOVA in Python
One way ANOVA (Analysis of Variance) is a technique for hypothesis testing. It is used to test whether the means of different group is really different.
Okaaaaay….
But then for all of you that are not used with Statistics, there might be big question arise: “Why would we even need to test this hypothesis testing?”
Let me explain really quickly. In real world, most of the time we can only take samples of quantitative data. Say that I conduct a simple survey, I want to know the height of people in a town with population 10 million, and I could only ask 100 people.
Now, I take the first survey and got 100 people data, it shows that the average of people’s height in the town are 170cm for men and 165cm for women. But somehow I want to re-do the survey in next day
Then I re-do the survey (assumes that I don’t have the same person in with previous survey) and suprisingly it shows that the average of people’s height in the town are 165cm for men and 165cm for women. Okay, now I am confused.
Sooooo.. do men and women have same height in this town?
This is it! It’s time to use statistics that you all have learnt!
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import seaborn as sns
import numpy as np
/home/arie/miniconda2/envs/data_analysis_351/lib/python3.5/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
from pandas.core import datetools
# don't ask about this, it's just happens to be a good style!
plt.style.use('fivethirtyeight')
The Hypothesis
For this toy problem purpose, I have a hypothesis that
for each diets, people weight’s mean is same.
Load The Data
Here I am using the Diet Dataset (see here for more datasets) from University of Sheffield for this practice problem. From the description here, the gender is binary variable which contains 0 for Female and 1 for Male.
data = pd.read_csv('https://www.sheffield.ac.uk/polopoly_fs/1.570199!/file/stcp-Rdataset-Diet.csv')
Getting Sense of The Dataset
data.head()
Person | gender | Age | Height | pre.weight | Diet | weight6weeks | |
---|---|---|---|---|---|---|---|
0 | 25 | 41 | 171 | 60 | 2 | 60.0 | |
1 | 26 | 32 | 174 | 103 | 2 | 103.0 | |
2 | 1 | 0 | 22 | 159 | 58 | 1 | 54.2 |
3 | 2 | 0 | 46 | 192 | 60 | 1 | 54.0 |
4 | 3 | 0 | 55 | 170 | 64 | 1 | 63.3 |
print('This dataset contains {} rows'.format(data.size))
This dataset contains 546 rows
See If There is Any Missing Values
data.gender.unique()
array([' ', '0', '1'], dtype=object)
# show which person have missing value in gender
data[data.gender == ' ']
Person | gender | Age | Height | pre.weight | Diet | weight6weeks | |
---|---|---|---|---|---|---|---|
0 | 25 | 41 | 171 | 60 | 2 | 60.0 | |
1 | 26 | 32 | 174 | 103 | 2 | 103.0 |
print('Missing value percentage of all data: {:.2f}%'.format(data[data.gender == ' '].size / data.size * 100))
Missing value percentage of all data: 2.56%
Cool! we only have ~3% missing value, either we could ignore, delete, or classify it’s gender by using the closest Height mean.
Getting the Sense of the Height Distribution
f, ax = plt.subplots(figsize=(11,9))
plt.title('Weight Distributions Between Sample')
plt.ylabel('pdf')
sns.distplot(data.weight6weeks)
<matplotlib.axes._subplots.AxesSubplot at 0x7f744d471320>
f, ax = plt.subplots(figsize=(11,9))
sns.distplot(data[data.gender == '1'].weight6weeks, ax=ax, label='Male')
sns.distplot(data[data.gender == '0'].weight6weeks, ax=ax, label='Female')
plt.title('Weight Distribution for Each Gender')
plt.legend()
<matplotlib.legend.Legend at 0x7f744d5dee48>
def infer_gender(x):
if x == '1':
return 'Male'
if x == '0':
return 'Female'
return 'Other'
def show_distribution(df, gender, column, group):
f, ax = plt.subplots(figsize=(11,9))
plt.title('Weight Distribution for {} on each {}'.format(gender, column))
for group_member in group:
sns.distplot(df[df[column] == group_member].weight6weeks, label='{}'.format(group_member))
plt.legend()
plt.show()
unique_diet = data.Diet.unique()
unique_gender = data.gender.unique()
for gender in unique_gender:
if gender != ' ':
show_distribution(data[data.gender == gender], infer_gender(gender), 'Diet', unique_diet)
data.groupby('gender').agg(
[np.mean, np.median, np.count_nonzero, np.std]
).weight6weeks
mean | median | count_nonzero | std | |
---|---|---|---|---|
gender | ||||
81.500000 | 81.5 | 2.0 | 30.405592 | |
0 | 63.223256 | 62.4 | 43.0 | 6.150874 |
1 | 75.015152 | 73.9 | 33.0 | 4.629398 |
data.groupby(['gender', 'Diet']).agg(
[np.mean, np.median, np.count_nonzero, np.std]
).weight6weeks
mean | median | count_nonzero | std | ||
---|---|---|---|---|---|
gender | Diet | ||||
2 | 81.500000 | 81.50 | 2.0 | 30.405592 | |
0 | 1 | 64.878571 | 64.50 | 14.0 | 6.877296 |
2 | 62.178571 | 61.15 | 14.0 | 6.274635 | |
3 | 62.653333 | 61.80 | 15.0 | 5.370537 | |
1 | 1 | 76.150000 | 75.75 | 10.0 | 5.439414 |
2 | 73.163636 | 72.70 | 11.0 | 3.818448 | |
3 | 75.766667 | 76.35 | 12.0 | 4.434848 |
We can see difference in weight on females in diet, but interestingly, it does not seems to affect males. LOL
Next question is: will the population behave the same?
The 1 Way Anova
The 1 way anova’s null hypothesis is \(\mu_{weight_{diet1}} = \mu_{weight_{diet2}} = \mu_{weight_{diet3}}\)
and this tests tries to see if it is true or not true
let’s assume that we have initially determine our confidence level of 95%, which means that we will accept 5% error rate.
mod = ols('Height ~ Diet', data=data[data.gender=='0']).fit()
# do type 2 anova
aov_table = sm.stats.anova_lm(mod, typ=2)
print('ANOVA table for Female')
print('----------------------')
print(aov_table)
print()
mod = ols('Height ~ Diet', data=data[data.gender=='1']).fit()
# do type 2 anova
aov_table = sm.stats.anova_lm(mod, typ=2)
print('ANOVA table for Male')
print('----------------------')
print(aov_table)
ANOVA table for Female
----------------------
sum_sq df F PR(>F)
Diet 559.680764 1.0 7.17969 0.010566
Residual 3196.086677 41.0 NaN NaN
ANOVA table for Male
----------------------
sum_sq df F PR(>F)
Diet 67.801603 1.0 0.43841 0.512784
Residual 4794.259003 31.0 NaN NaN
There are two p-values(PR(>F)) that we can see here, male and female.
For male, we cannot accept the null hypothesis under 95% confident level, because the p-value is greater than our alpha (0.05 < 0.512784). So given these three type of diet, there are no difference in male weights.
For female, since the p-value PR(>F) is less than our error rate (0.05 > 0.010566), we could reject the null hypothesis. This means we are quite confident that there is a different in height for Female in diets.
Okay so we know the effect of diet in female, but we don’t know which diet is different from which. We have to do post-hoc analysis using Tukey HSD (Honest Significant Difference) Test.
from statsmodels.stats.multicomp import pairwise_tukeyhsd, MultiComparison
# Only use female data
df = data[data.gender=='0']
# compare the height between each diet, using 95% confidence interval
mc = MultiComparison(df['Height'], df['Diet'])
tukey_result = mc.tukeyhsd(alpha=0.05)
print(tukey_result)
print('Unique diet groups: {}'.format(mc.groupsunique))
Multiple Comparison of Means - Tukey HSD,FWER=0.05
==============================================
group1 group2 meandiff lower upper reject
----------------------------------------------
1 2 -3.5714 -11.7861 4.6432 False
1 3 -8.7714 -16.848 -0.6948 True
2 3 -5.2 -13.2766 2.8766 False
----------------------------------------------
Unique diet groups: [1 2 3]
We can only reject the null hypothesis between diet type 1 and diet type 3, means there is statistically significant difference in weight for diet 1 and diet 3.
References:
- https://www.marsja.se/four-ways-to-conduct-one-way-anovas-using-python/
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html
- http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-16_23.html
- http://www.statsmodels.org/dev/generated/statsmodels.stats.anova.anova_lm.html
- https://www.sheffield.ac.uk/mash/statistics2/anova
- https://stackoverflow.com/questions/16049552/what-statistics-module-for-python-supports-one-way-anova-with-post-hoc-tests-tu