In this video we'll introduce the high-level concepts of Bayesian hypothesis testing and introduce several commonly used methods in Python code examples.
Hypothesis testing is a commonly used approach for engaging in statistical analysis. In statistics, a hypothesis is a premise, or a claim that we want to test or investigate, and one way we are going to investigate (we can't always go to a lab and do experiments) is that we can go ask a survey of several hundred people and get some data, and we can use those data as representative samples to study the population. So when we talk about hypothesis testing, or investigating on a claim, we are talking about doing some sampling, getting information, and then testing the hypothesis. So, much like in science we have an idea we have a problem and want to test that idea, in statistics, you have to pick out the hypothesis from reading the problem, and come up with a test that we want to investigate the problem. The goal of hypothesis testing, therefore, is to make decision based on what data would suggest what we want to investigate the problem. Back to the 1990s, the idea of coming up with Bayesian inference and deriving posterior density for analysis was a rocket science. It was frustrating experience to test claims solely with mathematics but not using efficient computational tools that nowadays we enjoy.
Now as probabilistic programming languages like PyMC3 and JAGS starts to blossom, Bayesian methods for estimation starts to prevail, and there's no exception for hypothesis testing.
There are no one-size-fits-all way of conducting Bayesian hypothesis testing. However, all testing methods leverage 1) the prior distribution, 2) the likelihood function and 3) the posterior distribution derived from either the Bayes' rule or MCMC simulation to make any informed decision. This makes testing statistical claims under a combination of various knowledge sources instead of only depending on a single study, therefore Bayesian hypothesis testing uses more reliable evidence and should be more convincing.
Bayesian hypothesis testing methods are categorized in three main types, namely
Traditional - Bayesian hypothesis testing methods which use the posterior distribution (i.e. 95% highest posterior interval) to determine if a certain parameter that we're going to test shows substantial difference against the null value - a value indicating no statistical relationship can be drawn in a parameter.
Proportional - proportional Bayesian hypothesis testing methods include all where we want to quantitatively test the strength of evidence in favor of one claim among two competing claims that will shape different likelihood functions. This includes the Bayes Factor method that we'll discuss later in this tutorial.
Cost-related - Still some Bayesian hypothesis testing methods are cost-related, where we want to minimize the cost of loss if we falsely decide on an action according to the best decision supported by the posterior sample results. high-level view about Bayesian hypothesis testing as of now, so let's look at a scenario.
Okay! Now we've learnt what is conceptually a Bayesian hypothesis test. We know that for each hypothesis test we either have enough evidence to support a claim or do not have sufficient evidence to support a claim. Null hypothesis is the default hypothesis in which we would like to test against with evidence. For example, if we are finding evidence from data to support that gender inequality is happening in any part of the world, null hypothesis represents the default claim that there are no difference between genders. The alternative hypothesis is the opposite of the null hypothesis. In this case, the alternative hypothesis claims that there are gender inequality supported by the data. Therefore, each positive testing problem has a dual hypothesis - no hypothesis and the alternative hypothesis. Let's come back to our example. So we are testing the dual hypothesis as follows:
Null hypothesis: We hypothesize that the salary distribution is equal between genders (Gender equality in salary distribution)
Alternative hypothesis: we want to find evidence to prove that the salary distribution is unequal between genders (Gender inequality in salary distribution)
For example, gender equality has raised tremendous dialogue and campaigns among the world. It is the state of equal ease of access to resources and opportunities regardless of gender, including economic participation and decision-making. If you want to figure out if female employees are paid less than male employees in a certain country, or area, you are involved in a two-sample one-sided hypothesis testing. The definition of two sample means that we use two samples in testing the hypothesis - the male employee sample, and the female employee sample. Chances are that your claim is bolstered by data evidence, or your claim is not supported by the data. However, if you want to prove that if female employees are paid either much more than, or much less than male employees in a certain country, then you are conducting a two-sample two-sided hypothesis testing. Chances are that the your claim is supported because either female employees are paid too low or too high compared to the male employees, and otherwise your claim does not have sufficient evidence to support because the salaries might be just slightly different between genders; and the difference is not substantial.
No one could actually foresee the result of hypothesis testing, let alone the variability of the evidence, until we can write scripts to combine our prior thinking and the data samples such that the Bayesian model is implemented.
Hopefully you enjoyed the video and learned something meaningful about Bayesian hypothesis testing. Let's start to code in the next video!
In this lecture, we'll dive into several research questions regarding the gender equality issues for testing a hypothesis in campus placement. We basically compare a different group of inputs and see the posterior distribution are they significantly different from each other. Do they have overlaps? Let's start coding!
# https://www.kaggle.com/duncankmckinnon/statistics-of-campus-placement-data
# So let's import several useful packages for data science, pandas, seaborn, numpy and matplotlib
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# Campus recruitment data is often recorded for many universities because it is a strategy to engage, source
# and hire talented students and recent graduates for internship or entry-level positions.
# These recruitment might take place in career fairs and placement seasons.
# Now let's read the campus placement data from the assets folder
placement = pd.read_csv("assets/placement_data_full_class.csv")
# and take a look!
placement
# You see that we have a rich data here. This data set consists of placement records of 215 students in
# a university campus in India aåccording to Kaggle.
# It includes secondary and higher secondary school percentage and specialization.
# It also includes degree specialization, type and Work experience and salary offers to the placed students
sl_no | gender | ssc_p | ssc_b | hsc_p | hsc_b | hsc_s | degree_p | degree_t | workex | etest_p | specialisation | mba_p | status | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | M | 67.00 | Others | 91.00 | Others | Commerce | 58.00 | Sci&Tech | No | 55.0 | Mkt&HR | 58.80 | Placed | 270000.0 |
1 | 2 | M | 79.33 | Central | 78.33 | Others | Science | 77.48 | Sci&Tech | Yes | 86.5 | Mkt&Fin | 66.28 | Placed | 200000.0 |
2 | 3 | M | 65.00 | Central | 68.00 | Central | Arts | 64.00 | Comm&Mgmt | No | 75.0 | Mkt&Fin | 57.80 | Placed | 250000.0 |
3 | 4 | M | 56.00 | Central | 52.00 | Central | Science | 52.00 | Sci&Tech | No | 66.0 | Mkt&HR | 59.43 | Not Placed | NaN |
4 | 5 | M | 85.80 | Central | 73.60 | Central | Commerce | 73.30 | Comm&Mgmt | No | 96.8 | Mkt&Fin | 55.50 | Placed | 425000.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
210 | 211 | M | 80.60 | Others | 82.00 | Others | Commerce | 77.60 | Comm&Mgmt | No | 91.0 | Mkt&Fin | 74.49 | Placed | 400000.0 |
211 | 212 | M | 58.00 | Others | 60.00 | Others | Science | 72.00 | Sci&Tech | No | 74.0 | Mkt&Fin | 53.62 | Placed | 275000.0 |
212 | 213 | M | 67.00 | Others | 67.00 | Others | Commerce | 73.00 | Comm&Mgmt | Yes | 59.0 | Mkt&Fin | 69.72 | Placed | 295000.0 |
213 | 214 | F | 74.00 | Others | 66.00 | Others | Commerce | 58.00 | Comm&Mgmt | No | 70.0 | Mkt&HR | 60.23 | Placed | 204000.0 |
214 | 215 | M | 62.00 | Central | 58.00 | Others | Science | 53.00 | Comm&Mgmt | No | 89.0 | Mkt&HR | 60.22 | Not Placed | NaN |
215 rows × 15 columns
Here our data has 215 records and 15 columns in total to and let's see the data verbal description list.
# We can visually inspect like does Gender affect salary. We can draw a kernel density plot
sns.kdeplot(data = placement, x = 'salary', hue = 'gender')
# The graph shows that:
# 1. Most of the candidates scored around 60 percentage got a decent package of around 3 lakhs PA
# 2. Not many candidates received salary more than 4 lakhs PA
<AxesSubplot:xlabel='salary', ylabel='Density'>
# We can draw 3 subplots in the figure to show the proportions of placed job, gender and prior working experience -
# there are numerous ways to do this but specifying fig and ax objects work fine.
fig, ax =plt.subplots(1,3, figsize = (15,4))
sns.countplot(placement['status'], ax = ax[0])
sns.countplot(placement['gender'], ax = ax[1])
sns.countplot(placement['workex'], ax = ax[2])
# We find interesting patterns in these three visualizations
# 1. There are more people placed with a job than they're not placed
# 2. There are more males recorded in this data set than females
# 3. There are higher proportion of people with no prior work experience than yes
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( /Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( /Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='workex', ylabel='count'>
# Although we can generate graphs for each factors that we are interested, now status, gender and workex
# are still categorical, so we can't directly use them for model unless we attribute them values, such
# as attributing 0 for female, and 1 for male candidates. This actually can be achieved by label encoding:
# Although is the first time to use a label encoder, I want to import sklearn label encoder.
from sklearn.preprocessing import LabelEncoder
# It's an one-line encoder to transform binary categorical variables into 0 and 1.
# We can apply the label encoder function for the category which has only two types of classes
# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
# We create a list of categorical variables that we're going to transform
object_cols = ['gender', 'workex', 'status']
# and write a for loop for col in all object columns,
for col in object_cols:
# We call the label_encoder to transform the object column into 0-1 binary form
placement[col] = label_encoder.fit_transform(placement[col])
# and now we see the data again.
placement.head()
# OK now you see the gender variable, the work experience variable, as well as the status variable
# are labeled zero and one, they all become binary.
sl_no | gender | ssc_p | ssc_b | hsc_p | hsc_b | hsc_s | degree_p | degree_t | workex | etest_p | specialisation | mba_p | status | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 67.00 | Others | 91.00 | Others | Commerce | 58.00 | Sci&Tech | 0 | 55.0 | Mkt&HR | 58.80 | 1 | 270000.0 |
1 | 2 | 1 | 79.33 | Central | 78.33 | Others | Science | 77.48 | Sci&Tech | 1 | 86.5 | Mkt&Fin | 66.28 | 1 | 200000.0 |
2 | 3 | 1 | 65.00 | Central | 68.00 | Central | Arts | 64.00 | Comm&Mgmt | 0 | 75.0 | Mkt&Fin | 57.80 | 1 | 250000.0 |
3 | 4 | 1 | 56.00 | Central | 52.00 | Central | Science | 52.00 | Sci&Tech | 0 | 66.0 | Mkt&HR | 59.43 | 0 | NaN |
4 | 5 | 1 | 85.80 | Central | 73.60 | Central | Commerce | 73.30 | Comm&Mgmt | 0 | 96.8 | Mkt&Fin | 55.50 | 1 | 425000.0 |
In this tutorial, we'll analyze a data about campus recruitment to answer -
Does Gender affect salary?
Does Gender affect the probability of placement?
# Let's import the pymc3 and arviz package to make a model call.
import pymc3 as pm
import arviz as az
# Research question 1: Does gender affect salary?
# I recommend a good practice before starting to model in tutorials to subset the data properly
# and save them into different variables. We first store the salary for men employees with gender equals to 1.
salary_male = placement[placement['gender'] == 1]['salary']
# and store another variable call salary female with the salary for women employees with gender equal to 0.
salary_female = placement[placement['gender'] == 0]['salary']
# Again, we enwrap the model now using a meaningful variable name called gender_salary hypothesis testing
with pm.Model() as gender_salary_ht:
# We can specify the prior of the mean obtaining degree as the normal distribution as prior and
# let's set 60 as mean and 5 as standard deviation
# Assume same salary for both genders, and let's set the salary as 300,000 and standard deviation
# as 50,000 for both mu_male and mu_female
mu_male = pm.Normal('mu_male', mu = 300000, sigma = 50000)
mu_female = pm.Normal('mu_female', mu = 300000, sigma = 50000)
# It's interesting to understand more the signa - the stabdard deviation. It tells you how much spread
# you can see in the salary for individuals with placement, so we got an absolute minimum of 0 -
# where all people just have the same salary with a placement on jobs.
sigma = pm.HalfNormal('sigma', 20000)
# Hypothesis: Same salary level between genders
salary_diff = pm.Deterministic('salary_diff', mu_male - mu_female)
# Likelihood: We're here conducting hypothesis testing, so we need to create both male and female
# estimates to compare their difference based on our data subsetted for male and female.
y_male = pm.Normal('y_male', mu = mu_male, sigma = sigma, observed = salary_male)
y_female = pm.Normal('y_female', mu = mu_female, sigma = sigma, observed = salary_female)
trace = pm.sample(3000, chains = 3, tune = 500)
# Take a coffee break for 30 seconds
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/pymc3/model.py:1755: ImputationWarning: Data in y_male contains missing values and will be automatically imputed from the sampling distribution. warnings.warn(impute_message, ImputationWarning) /Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/pymc3/model.py:1755: ImputationWarning: Data in y_female contains missing values and will be automatically imputed from the sampling distribution. warnings.warn(impute_message, ImputationWarning) <ipython-input-5-cf9c951c6f3e>:36: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning. trace = pm.sample(3000, chains = 3, tune = 500) Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (3 chains in 4 jobs) NUTS: [y_female_missing, y_male_missing, sigma, mu_female, mu_male]
Sampling 3 chains for 500 tune and 3_000 draw iterations (1_500 + 9_000 draws total) took 16 seconds.
# The first visualization we can make to diagnose the simulated result is a traceplot,
pm.traceplot(trace, var_names=['mu_male', 'mu_female', 'sigma', 'salary_diff'])
# We can see some fluctuations in these variables without special patterns in the right panel,
# indicating that the mixing of posterior samples is well
# But looking at the left panel, we do notice some salary difference for couple tens of thousands.
# That is, the point associated with the highest posterior density for salary_diff
# variable located at around 30,000. This means male employees tend to earn more salary than female employees,
# but we need to look at the 95% highest posterior density interval in order to determine if male employees
# earn significantly higher salary.
<ipython-input-6-91ba8b2e298d>:2: DeprecationWarning: The function `traceplot` from PyMC3 is just an alias for `plot_trace` from ArviZ. Please switch to `pymc3.plot_trace` or `arviz.plot_trace`. pm.traceplot(trace, var_names=['mu_male', 'mu_female', 'sigma', 'salary_diff']) /Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
array([[<AxesSubplot:title={'center':'mu_male'}>, <AxesSubplot:title={'center':'mu_male'}>], [<AxesSubplot:title={'center':'mu_female'}>, <AxesSubplot:title={'center':'mu_female'}>], [<AxesSubplot:title={'center':'sigma'}>, <AxesSubplot:title={'center':'sigma'}>], [<AxesSubplot:title={'center':'salary_diff'}>, <AxesSubplot:title={'center':'salary_diff'}>]], dtype=object)
# Now let's look at the posterior distribution with useful statistics.
# and make a 95% highest posterior density plot for each of the necessary variables - and we specifically
# look at the salary_diff variable.
az.plot_posterior(trace, var_names=['mu_male', 'mu_female', 'sigma', 'salary_diff'], hdi_prob = 0.95)
# Cool! You can see that the average salary difference appearing in the 95% highest posterior density interval
# is as high as approximately 30,000. But the lower end of the 95% HPD interval still crosses the negative side.
# In this case, although we're tempted to establish a conclusion about gender inequality, with our prior and our
# placement data, we do not have sufficient evidence to conclude that the salary distribution is unequal between
# genders. Conceptually speaking, we are 95% confident to say that the salary distribution is unequal between
# genders if the entire 95% HDI interval falls on the positive side (male earns significantly higher salary
# than female), or on the negative side (female earns significantly higher salary than male).
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
array([<AxesSubplot:title={'center':'mu_male'}>, <AxesSubplot:title={'center':'mu_female'}>, <AxesSubplot:title={'center':'sigma'}>, <AxesSubplot:title={'center':'salary_diff'}>], dtype=object)
# When we learned Bayesian regression, we developed reliability test based on three criterions:
# 1) Convergence, 2) Effective sample size, and 3) Autocorrelation.
# Convergence: Let's first check rhat for each variable by using the az.rhat function. It is an alternative
# way of getting the rhat statistics just as using the pm.summary() function.
az.rhat(trace, var_names=['mu_male', 'mu_female', 'sigma', 'salary_diff'])
# You can see that for all variables, the rhat statistics are very closed to 1.0 - convergence for the difference
# in salary is well-developed. From the convergence viewpoint, the results from the posterior samples are reliable.
# We can say that the difference in salary between genders are not obtained purely by chance.
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
<xarray.Dataset> Dimensions: () Data variables: mu_male float64 1.0 mu_female float64 1.0 sigma float64 1.0 salary_diff float64 1.0
array(1.00039848)
array(1.00003403)
array(1.00036303)
array(1.00026887)
# Next, we should check autocorrelation. We can use the plot_autocorr function to draw the autocorrelation
# since from the first draw to the last draw of 3000 posterior samples. Note that the maximum sample to show
# is 100 by default and I did not change the number, because in most cases the autocorrelation drops sharply
# such that it becomes invisible at the 100th sample.
az.plot_autocorr(trace,
var_names=['mu_male', 'mu_female', 'sigma', 'salary_diff'],
combined = True)
# We can say that the posterior samples are reliable by looking at the dwindling autocorrelation for each plot.
# We can't see any patterns of autocorrelation that the latter posterior samples depend on their previous values.
# At the end of the hundred sample autocorrelation, all plots show that the autocorrlation has dropped to around 0.
array([<AxesSubplot:title={'center':'mu_male'}>, <AxesSubplot:title={'center':'mu_female'}>, <AxesSubplot:title={'center':'sigma'}>, <AxesSubplot:title={'center':'salary_diff'}>], dtype=object)
# Now let's assess the effective sample size in a forest plot. Since each chain with high autocorrelation
# have fewer independent samples than the total number of posterior samples, we can use the plot_forest function and
# customize the ess option to be True to check if the number of effective sample size is above the threshold 200.
az.plot_forest(trace, var_names=['salary_diff'], ess = True, hdi_prob = 0.95)
# You see that the effective size goes high - it's around 10,000, so we get more than enough effective samples
# to obtain stable posterior distribution for the salary difference.
# Finally we look at the highest posterior density plots showing on the left panel.
# Each of the three blue lines represents the 95% highest posterior density for each chain.
# You can see that two out of three lines lie totally on the positive side (you might get different results if you
# run the model many times due to randomness), but one line crosses the 0.
# So we still need to look at the HDI plot, and as we've seen negative values in the 95% HPD, we do not have
# enough evidence to bolster the claim that gender inequality has taken place in job placement at
# certain area. Great! Now let's further explore the topic about gender equality.
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
array([<AxesSubplot:title={'center':'95.0% HDI'}>, <AxesSubplot:title={'center':'ess'}>], dtype=object)
# Research question 2: Does gender affect placement?
# with just substantiate the salary difference between genders. Now does there are significant difference between genders in
# terms of the probability of getting job placement?
# Now we first store the placement results for men employees with gender equals to 1.
status_male = placement[placement['gender'] == 1]['status']
# and store another variable call status_female with the placement status for women employees with
# gender equal to 0. In both cases the value 1 means that individual was placed with a job, and 0
# means that in the individual was not placed with a job.
status_female = placement[placement['gender'] == 0]['status']
# Again, we can create a PyMC3 model called gender_placement hypothesis testing
with pm.Model() as gender_placement_ht:
# We can specify the prior of the mean obtaining degree as the normal distribution as prior and
# let's set 60 as mean and 5 as standard deviation
# Assume same salary for both genders, and let's set the salary as 300,000 and standard deviation
# as 50,000 for both mu_male and mu_female
prob_male = pm.Beta('prob_male', alpha = 30, beta = 30)
prob_female = pm.Beta('prob_female', alpha = 30, beta = 30)
# Hypothesis: Same placement probability level between genders
placement_diff = pm.Deterministic('placement_diff', prob_male - prob_female)
# Likelihood: We're here conducting hypothesis testing, so we need to create both male and female
# estimates to compare their difference based on our data subsetted for male and female.
y_male = pm.Bernoulli('y_male', p = prob_male, observed = status_male)
y_female = pm.Bernoulli('y_female', p = prob_female, observed = status_female)
trace = pm.sample(3000, chains = 3, tune = 500)
# Take a coffee break for 30 seconds
<ipython-input-11-c456e3e05947>:30: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning. trace = pm.sample(3000, chains = 3, tune = 500) Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (3 chains in 4 jobs) NUTS: [prob_female, prob_male]
Sampling 3 chains for 500 tune and 3_000 draw iterations (1_500 + 9_000 draws total) took 111 seconds.
# Cool! It took us around 13 seconds to finish the simulation and we can diagnose the posterior samples
# in a traceplot. I saw the deprecating message regarding the traceplot, so for the future visualization,
# I'll change the code to az.plot_trace, add the visualization produced is exactly the same.
az.plot_trace(trace)
# We can see some fluctuations in the prob_male, prob_female and placement_diff variables without
# special patterns in the right panel, indicating that the mixing of posterior samples is reasonable.
# But looking at the left panel, we do notice some difference regarding the job placement probability.
# That is, the point associated with the highest posterior density for placement_diff
# variable located at around 0.08.
# This means male candidates have about 8% higher chance to get a job offer in the region of analysis.
# Note that in many countries, women are more likely to get hired, so this example won't conclude a saying that
# "Hey, in this Bayesian class we learnt that men are more likely to get hired". The world is exciting and
# we've substantial amount of data, many metrics available to make comparisons between genders. We
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
array([[<AxesSubplot:title={'center':'prob_male'}>, <AxesSubplot:title={'center':'prob_male'}>], [<AxesSubplot:title={'center':'prob_female'}>, <AxesSubplot:title={'center':'prob_female'}>], [<AxesSubplot:title={'center':'placement_diff'}>, <AxesSubplot:title={'center':'placement_diff'}>]], dtype=object)
# Now let's assess the posterior results, combined with the effective sample size and convergence
# diagnostic r_hat in a forest plot.
az.plot_forest(trace, var_names=['placement_diff'], r_hat = True, ess = True, hdi_prob = 0.95)
# You see that the effective size goes higher than 5000 for the difference in placement probability between genders,
# so we have sufficient effective samples to obtain stable posterior distribution.
# And we see that the r_hat locates at 1 - so the r_hat metric indicates that all the chains are well converged,
# allowing us to interpret that the difference in job placement probability is come from sound evidence.
# Finally we look at the 95% highest posterior density interval. We can see that the posterior mean of all three
# chains hovers around 0.08, and the lowest end crosses the 0 point. Therefore, the average placement
# proportion difference (male - female) is approximately 8 percent - female candidates have lower job
# placement proportion compared to the male candidates. However, in 95% confidence, we do not have sufficient
# evidence to establish that the difference in job placement probability is substantial - again, we can't say
# that the job placement opportunity difference observed in the data indicates significant gender inequality.
# The posterior samples do not provide enough evidence to claim that gender inequality is significant in that area.
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
array([<AxesSubplot:title={'center':'95.0% HDI'}>, <AxesSubplot:title={'center':'ess'}>, <AxesSubplot:title={'center':'r_hat'}>], dtype=object)
# if we want to know exactly how strong our posterior sample indicates the difference,
# we can make a posterior plot with the reference value equals 0
az.plot_posterior(trace, var_names = ['placement_diff'], ref_val = 0)
# In this case, you see a quantity "92.8%" showing up in orange text at the right hand side of the reference value 0.
# It does not have to be 92.8% (it's fine if it shows other values in your case), it actually means that we are 92.8%
# confident that the placement opportunities are different between genders. So choosing the level of confidence is
# important, because if we choose a higher level of confidence, we need stronger evidence to develop significance
# difference between groups in Bayesian hypothesis testing. Obviously in this case, 92.8% is smaller than 95%, so
# it indicates that we can't support our claim with the data we have in the placement data under 95% confidence.
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
<AxesSubplot:title={'center':'placement_diff'}>
Up till now we've seen how to conduct Bayesian hypothesis testing to compare male and female candidates in terms of job placement probability and average salary. They are comparative in nature and the advantage of Bayesian estimation are two-fold.
The Bayesian methods can show you not only the point estimate of how large the group differences such as the average salary are, but also the uncertainty. Knowing the uncertainty really helps us to interpret the reasonable range of results with their associated likelihood.
The Bayesian methods show the likelihood of the posterior distribution of the hypothesis for every single point given the data, so it helps us to formulate an updated belief not only for the possible values, but also the strength.
In this lecture, you've seen how to test a hypothesis using PyMC3. Bayesian hypothesis testing provides a structured way to quantify logical processes that we do in our daily life. One last minute note is that if you disagree with the prior I provide, it is totally acceptable - then we can observe the same data, but may reach different conclusions. In many situations, the logic of Bayesian thinking is similar to how we think naturally - we update our initial belief by constantly incorporating new evidence. Make sure we've now learnt how to specify problems clearly, execute an appropriate model, customize the posterior plots (e.g. with the ref_val
, hdi_prob
and rope
options to point the null hypothesis quantity as reference value specify the level of credibility and the region of practical significance where the quantities should fall outside such that we can confidently choose in favor of the alternative hypothesis). I hope you enjoy the first Bayesian hypothesis testing example. I'll see you in the next video!
I would like to pay special thanks to Dr. Dhimant Ganatara, who is a Professor at Jain University, for creating this dataset in Kaggle that I used it for Bayesian analysis.
In this lecture, we'll discuss another type of hypothesis testing, a proportional type of hypothesis testing we called Bayes Factor. We will outline an experiment and obtain Bayes Factor to test competing hypotheses and use posterior distribution results from previous experiments to update prior distributions for subsequent experiments.
In the group comparison video, we have examined the campus recruitment data to find out if gender inequality exists in terms of worker's salary and the chance of job placement. We compared the average salaries and job placement probability between male and female, using the 95% highest posterior density interval as evidence to make our conclusion. Some people in Bayesian statistics, however, prefer comparing competing claims. Rather than comparing gender groups for salary, there is a need to test pairs of competing claims and figure out under which claim we can learn more information from the data.
This notion guides us to today's topic - Bayes Factor. Introduced by Harold Jeffreys in the 1960s, Bayes factor’ is a Bayesian method of hypothesis testing that expresses the relative odds of two or multiple competing hypotheses, and is usually used to determine which model better fits the data (Jeffreys 1961). Bayes Factor is a ratio that represents the amount of information that we've learnt about our competing hypotheses from the data.
# Let's now look at the first few rows of the data again.
placement.head(5)
# In the placement data, we've seen there are some people who have prior working experience and some
# people who have not. We're curious if prior working experience is associated with increased salary
# upon job placement.
sl_no | gender | ssc_p | ssc_b | hsc_p | hsc_b | hsc_s | degree_p | degree_t | workex | etest_p | specialisation | mba_p | status | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 67.00 | Others | 91.00 | Others | Commerce | 58.00 | Sci&Tech | 0 | 55.0 | Mkt&HR | 58.80 | 1 | 270000.0 |
1 | 2 | 1 | 79.33 | Central | 78.33 | Others | Science | 77.48 | Sci&Tech | 1 | 86.5 | Mkt&Fin | 66.28 | 1 | 200000.0 |
2 | 3 | 1 | 65.00 | Central | 68.00 | Central | Arts | 64.00 | Comm&Mgmt | 0 | 75.0 | Mkt&Fin | 57.80 | 1 | 250000.0 |
3 | 4 | 1 | 56.00 | Central | 52.00 | Central | Science | 52.00 | Sci&Tech | 0 | 66.0 | Mkt&HR | 59.43 | 0 | NaN |
4 | 5 | 1 | 85.80 | Central | 73.60 | Central | Commerce | 73.30 | Comm&Mgmt | 0 | 96.8 | Mkt&Fin | 55.50 | 1 | 425000.0 |
We might have mixed perspectives on this question. In our job market, many production line companies offer higher wages to new hires than they're paying for more senior employees - this is a form of pay compression. Other companies like tech companies tend to pay for new hires slightly lower than that of current employees if they both work for the same position. One reason of paying higher salary to new hires is that companies want to attract potential workers to retain and work long term. If we want to compare different hypothesis (say one believes that new hires receive a higher paycheck than senior employees and one believes the opposite), we want to look at the ratio of two hypotheses. We are interested in how much the observation of placement data could change the first belief versus the second belief, regarding whether prior working experience will add assets to one's salary or diminish one's salary.
This is a very typical example of how Bayes Factor can compare the probability of competing hypotheses given the data we observed. We extend the notion of wanting to know if the data give evidence to difference of salary distribution due to gender difference (which is an absolute concept), to compare which prior belief, or which hypothesis allows you to gain for information from data compared to the competing hypothesis.
The Bayes factor for comparing two models may be approximated as the ratio of the marginal likelihood of the data that is compared against two different hypotheses. So instead of computing the posterior distribution of a certain parameter using the Bayes' rule, we need to compute the ratio of two posterior probability of data - the marginal likelihood. In order to calculate the ratio - Bayes Factor, we require the prior distribution of both hypotheses and the posterior distribution sampled by the MCMC algorithms in PyMC3. Now let's look at the graph - how to quantify the strength of evidence by Bayes Factor.
Now let's explore Bayes Factor with an example.
Bayes Factor Source: https://zj-zhang.github.io/2016/12/31/Bayes-Factor/
# Research question: Does gender affect salary?
salary_male = placement[placement['gender'] == 1]['salary'].dropna()
salary_female = placement[placement['gender'] == 0]['salary'].dropna()
# We can assume that male has naturally higher salary to justify the first prior setting, and assume
# that no difference should be found between two genders to justify the second prior setting.
priors = ((300000, 50000, 200000, 50000), (300000, 50000, 300000, 50000))
models = []
traces = []
# Let's create a for loop that takes the priors for mean and standard deviation of male salary, and
# mean and standard deviation of female salary out for each assumption.
for mean_male, sigma_male, mean_female, sigma_female in priors:
# Again, for each prior, we enwrap the model and use bf_model for the model name
with pm.Model() as bf_model:
# We took the model from the last video when we compared the salary distribution between male and female.
# But this time we substitute the prior values with variables pre-defined with the two assumptions.
# So for the first model, we model with the prior knowledge that male earns higher (300,000) than female
# (200,000). For the second model, both genders earn the same salaries (300,000).
# For both models, we set the standard deviation of salary distribution as 50,000 because the salary
# can differ among each accepted candidate.
mu_male = pm.Normal('mu_male', mu = mean_male, sigma = sigma_male)
mu_female = pm.Normal('mu_female', mu = mean_female, sigma = sigma_female)
# Likelihood: We also assume that the standard deviation for the actual data that come from each prior
# setting to be the same. This is because now we're not interested in comparing the notions of how
# uncertain the estimates of salaries between genders are. We're only comparing the difference in using
# "gender-equality" prior and "gender-inequality" prior in order to understand which prior setting
# brings about significantly more information to the model fit.
y_male = pm.Normal('y_male', mu = mu_male, sigma = 10000, observed = salary_male)
y_female = pm.Normal('y_female', mu = mu_female, sigma = 10000, observed = salary_female)
# Finally, to obtain Bayes Factor, we use the sequential monte carlo sampler instead of the default
# NUTS sampler - so the result that we'll get includes not only the posterior mean or variance, but
# also an estimation of the marginal likelihood.
trace = pm.sample_smc(3000)
# Let's append the model result and the trace object to the list we've defined at the beginning.
models.append(bf_model)
traces.append(trace)
# Note that the advantage of sequential monte carlo sampler is that it can even compute the marginal
# likelihood without knowing the closed-form mathematical expression. But the cost of it is that it
# is less efficient compared to NUTS.
Initializing SMC sampler... Sampling 4 chains in 4 jobs Stage: 0 Beta: 0.001 Stage: 1 Beta: 0.003 Stage: 2 Beta: 0.010 Stage: 3 Beta: 0.035 Stage: 4 Beta: 0.119 Stage: 5 Beta: 0.403 Stage: 6 Beta: 1.000 Stage: 0 Beta: 0.001 Stage: 1 Beta: 0.003 Stage: 2 Beta: 0.010 Stage: 3 Beta: 0.035 Stage: 4 Beta: 0.118 Stage: 5 Beta: 0.408 Stage: 6 Beta: 1.000 Stage: 0 Beta: 0.001 Stage: 1 Beta: 0.003 Stage: 2 Beta: 0.010 Stage: 3 Beta: 0.033 Stage: 4 Beta: 0.116 Stage: 5 Beta: 0.401 Stage: 6 Beta: 1.000 Stage: 0 Beta: 0.001 Stage: 1 Beta: 0.003 Stage: 2 Beta: 0.011 Stage: 3 Beta: 0.036 Stage: 4 Beta: 0.121 Stage: 5 Beta: 0.402 Stage: 6 Beta: 1.000 Initializing SMC sampler... Sampling 4 chains in 4 jobs Stage: 0 Beta: 0.001 Stage: 1 Beta: 0.005 Stage: 2 Beta: 0.018 Stage: 3 Beta: 0.061 Stage: 4 Beta: 0.209 Stage: 5 Beta: 0.733 Stage: 6 Beta: 1.000 Stage: 0 Beta: 0.001 Stage: 1 Beta: 0.005 Stage: 2 Beta: 0.019 Stage: 3 Beta: 0.064 Stage: 4 Beta: 0.226 Stage: 5 Beta: 0.766 Stage: 6 Beta: 1.000 Stage: 0 Beta: 0.001 Stage: 1 Beta: 0.005 Stage: 2 Beta: 0.018 Stage: 3 Beta: 0.063 Stage: 4 Beta: 0.223 Stage: 5 Beta: 0.773 Stage: 6 Beta: 1.000 Stage: 0 Beta: 0.001 Stage: 1 Beta: 0.005 Stage: 2 Beta: 0.018 Stage: 3 Beta: 0.062 Stage: 4 Beta: 0.208 Stage: 5 Beta: 0.713 Stage: 6 Beta: 1.000
# Cool! Now let's do some Bayes Factor inference. We can compute the Bayes Factor by
# subtracting the log-likelihood between two models, and then take its exponential value by the np.exp function.
BF_smc = np.exp(traces[0].report.log_marginal_likelihood - traces[1].report.log_marginal_likelihood)
BF_smc
# We can see that the Bayes Factors are smaller than 1. I'm surprised about this result! Let's recall the Bayes
# Factor table, it means that there are anecdotal evidence - or weak evidence, that the prior with same salary
# distribution will outperform the prior with differential salary distribution between genders.
# Using an assumption that male earns more than female will give the worse posterior estimates.
# However, how different the posteriors we get from these two models are?
array([0.51811128, 0.52678248, 0.46620985, 0.51487289])
# Let's look at the summary for the model with unequal prior.
az.summary(traces[0]).round(2)
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
mu_male | 298911.29 | 1005.05 | 297122.55 | 300910.86 | 9.02 | 6.38 | 12429.0 | 11808.0 | 1.0 |
mu_female | 267221.49 | 1462.27 | 264475.04 | 269938.47 | 13.30 | 9.40 | 12092.0 | 11481.0 | 1.0 |
# and the summary for the model with equal prior.
az.summary(traces[1]).round(2)
# We might argue that two models are actually very similar even with different priors, and both models
# converge and the posterior estimates are reliable with more than 10,000 effective samples.
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
mu_male | 298903.43 | 1007.02 | 297018.45 | 300782.17 | 9.20 | 6.51 | 11967.0 | 11733.0 | 1.0 |
mu_female | 267321.32 | 1433.62 | 264752.59 | 270118.29 | 13.06 | 9.23 | 12056.0 | 11930.0 | 1.0 |
# We can also use the following code to compare the posterior protective of male salary distributions given from
# the assumption that both two prior given to the model. Let's run it directly.
_, ax = plt.subplots()
ppc_0 = pm.sample_posterior_predictive(traces[0], 100, models[0], size = (100, 20))
ppc_1 = pm.sample_posterior_predictive(traces[1], 100, models[1], size = (100, 20))
for m_0, m_1 in zip(ppc_0['y_male'].T, ppc_1['y_male'].T):
az.plot_kde(np.mean(m_0, 0), ax = ax, plot_kwargs = {"color": "C0"})
az.plot_kde(np.mean(m_1, 0), ax = ax, plot_kwargs = {"color": "C1"})
ax.plot([], label = "model_different_prior")
ax.plot([], label = "model_same_prior")
ax.legend()
ax.set(xlabel = "male salary prediction")
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/pymc3/sampling.py:1689: UserWarning: samples parameter is smaller than nchains times ndraws, some draws and/or chains may not be represented in the returned posterior predictive sample warnings.warn(
<matplotlib.legend.Legend at 0x7f8c7ceeee50>
# We can use the same code to compare the posterior protective of female salary distributions given from
# the assumption that both two prior given to the model. Let's run it directly.
_, ax = plt.subplots()
ppc_0 = pm.sample_posterior_predictive(traces[0], 100, models[0], size = (100, 20))
ppc_1 = pm.sample_posterior_predictive(traces[1], 100, models[1], size = (100, 20))
for m_0, m_1 in zip(ppc_0['y_female'].T, ppc_1['y_female'].T):
az.plot_kde(np.mean(m_0, 0), ax = ax, plot_kwargs = {"color": "C0"})
az.plot_kde(np.mean(m_1, 0), ax = ax, plot_kwargs = {"color": "C1"})
ax.plot([], label = "model_different_prior")
ax.plot([], label = "model_same_prior")
ax.legend()
ax.set(xlabel = "female salary prediction")
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/pymc3/sampling.py:1689: UserWarning: samples parameter is smaller than nchains times ndraws, some draws and/or chains may not be represented in the returned posterior predictive sample warnings.warn(
[Text(0.5, 0, 'female salary prediction')]
Although both models appear to be quite consistent for the salary distribution among different genders. We can still observe that the actual data is more concentrated with the model with the same prior assigned to both genders. We can say that this difference is captured by the Bayes factors, although the evidence is just anecdotal. In this example, we can also see that it is possible to have two different models, with different Bayes factors, but have quite similar predictions. Actually, if the number of observed data is high, it can provide sufficient information to reduce the prior effect on the posterior distribution.
Bayes factor plays a significant role in Bayesian hypothesis testing. We've seen we compared two prior assumptions on the model. But different from standard hypothesis testing, it compute the relative probabilities of the models but not directly measuring the differences. So it's fine that we sometimes feel uncertain that one hypothesis over the two must be true or not. Hopefully you've learnt the techniques and the mindset of using Bayes factor. Let's see each other in the next video!
We need to determine which station model can best represents the situation amount different choices that we are coming up with. In this lecture, we will going to look at several methods in order to find out the best model among different choices. First we start with information criterion.
There are several information criteria in which typically Bayesian statistics are typically used to compare different model specifications. In Bayesian inference, the most common method of assessing the goodness of fit of an estimated statistical model is Deviance Information Criteria (DIC) and the Widely Applicable Information Criterion (WAIC).
DIC may be compared across different models and even different methods, as long as the dependent variable does not change between models, making DIC the most flexible model fit statistic
DIC is valid only when the joint posterior distribution is approximately multivariate normal. Models should be preferred with smaller DIC. Since DIC increases with model complexity (pD or pV), simpler models are preferred.
The Widely Applicable Information Criterion (WAIC) is an information criterion that is more fully Bayesian than DIC. The result of WAIC more closely resembles leave-one-out cross-validation.
when selecting the best models to represent the problem, we typically first choose the one with the smallest information criteria value.
# # Information Criterion: AIC, BIC, DIC and WAIC
# https://docs.pymc.io/notebooks/model_comparison.html
# az.loo function to check the information criteria for the first model.
az.loo(trace, gender_placement_ht)
# You can see that the estimate for elpd_loo is -7.48, with a standard error of 0.58.
# And the probability underlying the loo (Leave one out cross-validation) is 0.53.
# When we compare models, we want to choose the LOO values that is the lowest because this is computed through
# a transformed scale such that the lower values indicate a better fit of the Bayesian model to the actual data.
Both DIC and WAIC are related to out-of-sample/generalization prediction. I think this is a general good metric to evaluate models, even when you care more about the parameters than about the predictions. The general idea is that if your model and parameters are a good description of the underlaying phenomena or process that you are studying they should be able to predict unobserved (but observable) future data.
If you get a warning you have a couple of options (besides ignoring them) to use other methods, like use LOO instead of WAIC (or vice versa), use K-fold cross validation, change your model and use one that is more robust. Of course to compare your models you can also add to the mix posterior predictive checks (although this in an in-sample analysis.) and background information. A little bit more about the warnings. Those are based on empirical observations. Is my opinion that we need more work on this, but as this point this the best thing we have. I have been thinking in adding tools to help diagnose or at least visualize the problematic points, thanks for reminding me about this! Notice that when using DIC/BPIC you always get a nice result without any warnings (even in assumptions are not met) and that could lead to overconfidence!
DIC assumes the Posterior is Gaussian, the more you move away from this assumption the more misleading the values of DIC will be. Someone corrects if I am wrong, but is my understanding that hierarchical models tend to have non-gaussian Posteriors. Also WAIC is more Bayesian because you are averaging over the posterior distribution.
I hope you've gained a fundamental understanding about Bayesian hypothesis testing at this point. We've shown two Bayesian hypothesis testing paradigms - 1) traditional hypothesis testing through comparing the HPD interval and 2) proportional where we compare two competing hypotheses through Bayes Factor. Using the Bayes Factor should be justified in many group research scenarios - whenever a group has not reached consensus on the background knowledge, we might want to test its sensitivity, potential to gain information from the data. Congratulations! Let's continue exploring Bayesian estimation in the next video!