# https://www.kaggle.com/duncankmckinnon/statistics-of-campus-placement-data
# So let's import several useful packages for data science, pandas, seaborn, numpy and matplotlib

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Campus recruitment data is often recorded for many universities because it is a strategy to engage, source
# and hire talented students and recent graduates for internship or entry-level positions. 
# These recruitment might take place in career fairs and placement seasons.

# Now let's read the campus placement data from the assets folder
placement = pd.read_csv("assets/placement_data_full_class.csv")
# and take a look!
placement

# You see that we have a rich data here. This data set consists of placement records of 215 students in 
# a university campus in India aåccording to Kaggle. 
# It includes secondary and higher secondary school percentage and specialization. 
# It also includes degree specialization, type and Work experience and salary offers to the placed students


# We can visually inspect like does Gender affect salary. We can draw a kernel density plot
sns.kdeplot(data = placement, x = 'salary', hue = 'gender')

# The graph shows that:
# 1. Most of the candidates scored around 60 percentage got a decent package of around 3 lakhs PA
# 2. Not many candidates received salary more than 4 lakhs PA

<AxesSubplot:xlabel='salary', ylabel='Density'>


# We can draw 3 subplots in the figure to show the proportions of placed job, gender and prior working experience - 
# there are numerous ways to do this but specifying fig and ax objects work fine.
fig, ax =plt.subplots(1,3, figsize = (15,4))
sns.countplot(placement['status'], ax = ax[0])
sns.countplot(placement['gender'], ax = ax[1])
sns.countplot(placement['workex'], ax = ax[2])

# We find interesting patterns in these three visualizations
# 1. There are more people placed with a job than they're not placed
# 2. There are more males recorded in this data set than females
# 3. There are higher proportion of people with no prior work experience than yes

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

<AxesSubplot:xlabel='workex', ylabel='count'>


# Although we can generate graphs for each factors that we are interested, now status, gender and workex
# are still categorical, so we can't directly use them for model unless we attribute them values, such
# as attributing 0 for female, and 1 for male candidates. This actually can be achieved by label encoding:

# Although is the first time to use a label encoder, I want to import sklearn label encoder. 
from sklearn.preprocessing import LabelEncoder

# It's an one-line encoder to transform binary categorical variables into 0 and 1. 
# We can apply the label encoder function for the category which has only two types of classes

# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()

# We create a list of categorical variables that we're going to transform
object_cols = ['gender', 'workex', 'status']
# and write a for loop for col in all object columns,
for col in object_cols:
    # We call the label_encoder to transform the object column into 0-1 binary form
    placement[col] = label_encoder.fit_transform(placement[col])
    
# and now we see the data again.    
placement.head()

# OK now you see the gender variable, the work experience variable, as well as the status variable 
# are labeled zero and one, they all become binary.


# Let's import the pymc3 and arviz package to make a model call.
import pymc3 as pm
import arviz as az

# Research question 1: Does gender affect salary?

# I recommend a good practice before starting to model in tutorials to subset the data properly 
# and save them into different variables. We first store the salary for men employees with gender equals to 1.
salary_male = placement[placement['gender'] == 1]['salary']
# and store another variable call salary female with the salary for women employees with gender equal to 0.
salary_female = placement[placement['gender'] == 0]['salary']

# Again, we enwrap the model now using a meaningful variable name called gender_salary hypothesis testing
with pm.Model() as gender_salary_ht:
    
    # We can specify the prior of the mean obtaining degree as the normal distribution as prior and 
    # let's set 60 as mean and 5 as standard deviation
    # Assume same salary for both genders, and let's set the salary as 300,000 and standard deviation 
    # as 50,000 for both mu_male and mu_female
    mu_male = pm.Normal('mu_male', mu = 300000, sigma = 50000)
    mu_female = pm.Normal('mu_female', mu = 300000, sigma = 50000)
    
    # It's interesting to understand more the signa - the stabdard deviation. It tells you how much spread
    # you can see in the salary for individuals with placement, so we got an absolute minimum of 0 - 
    # where all people just have the same salary with a placement on jobs.
    sigma = pm.HalfNormal('sigma', 20000)
    
    # Hypothesis: Same salary level between genders
    salary_diff = pm.Deterministic('salary_diff', mu_male - mu_female)
    
    # Likelihood: We're here conducting hypothesis testing, so we need to create both male and female
    # estimates to compare their difference based on our data subsetted for male and female.
    y_male = pm.Normal('y_male', mu = mu_male, sigma = sigma, observed = salary_male)
    y_female = pm.Normal('y_female', mu = mu_female, sigma = sigma, observed = salary_female)
    
    trace = pm.sample(3000, chains = 3, tune = 500)

    # Take a coffee break for 30 seconds

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/pymc3/model.py:1755: ImputationWarning: Data in y_male contains missing values and will be automatically imputed from the sampling distribution.
  warnings.warn(impute_message, ImputationWarning)
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/pymc3/model.py:1755: ImputationWarning: Data in y_female contains missing values and will be automatically imputed from the sampling distribution.
  warnings.warn(impute_message, ImputationWarning)
<ipython-input-5-cf9c951c6f3e>:36: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning.
  trace = pm.sample(3000, chains = 3, tune = 500)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 4 jobs)
NUTS: [y_female_missing, y_male_missing, sigma, mu_female, mu_male]

Sampling 3 chains for 500 tune and 3_000 draw iterations (1_500 + 9_000 draws total) took 16 seconds.


# The first visualization we can make to diagnose the simulated result is a traceplot, 
pm.traceplot(trace, var_names=['mu_male', 'mu_female', 'sigma', 'salary_diff'])

# We can see some fluctuations in these variables without special patterns in the right panel, 
# indicating that the mixing of posterior samples is well
# But looking at the left panel, we do notice some salary difference for couple tens of thousands. 
# That is, the point associated with the highest posterior density for salary_diff 
# variable located at around 30,000. This means male employees tend to earn more salary than female employees, 
# but we need to look at the 95% highest posterior density interval in order to determine if male employees
# earn significantly higher salary.

<ipython-input-6-91ba8b2e298d>:2: DeprecationWarning: The function `traceplot` from PyMC3 is just an alias for `plot_trace` from ArviZ. Please switch to `pymc3.plot_trace` or `arviz.plot_trace`.
  pm.traceplot(trace, var_names=['mu_male', 'mu_female', 'sigma', 'salary_diff'])
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([[<AxesSubplot:title={'center':'mu_male'}>,
        <AxesSubplot:title={'center':'mu_male'}>],
       [<AxesSubplot:title={'center':'mu_female'}>,
        <AxesSubplot:title={'center':'mu_female'}>],
       [<AxesSubplot:title={'center':'sigma'}>,
        <AxesSubplot:title={'center':'sigma'}>],
       [<AxesSubplot:title={'center':'salary_diff'}>,
        <AxesSubplot:title={'center':'salary_diff'}>]], dtype=object)


# Now let's look at the posterior distribution with useful statistics. 

# and make a 95% highest posterior density plot for each of the necessary variables - and we specifically
# look at the salary_diff variable.
az.plot_posterior(trace, var_names=['mu_male', 'mu_female', 'sigma', 'salary_diff'], hdi_prob = 0.95)

# Cool! You can see that the average salary difference appearing in the 95% highest posterior density interval 
# is as high as approximately 30,000. But the lower end of the 95% HPD interval still crosses the negative side.
# In this case, although we're tempted to establish a conclusion about gender inequality, with our prior and our 
# placement data, we do not have sufficient evidence to conclude that the salary distribution is unequal between
# genders. Conceptually speaking, we are 95% confident to say that the salary distribution is unequal between 
# genders if the entire 95% HDI interval falls on the positive side (male earns significantly higher salary 
# than female), or on the negative side (female earns significantly higher salary than male).

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([<AxesSubplot:title={'center':'mu_male'}>,
       <AxesSubplot:title={'center':'mu_female'}>,
       <AxesSubplot:title={'center':'sigma'}>,
       <AxesSubplot:title={'center':'salary_diff'}>], dtype=object)


# When we learned Bayesian regression, we developed reliability test based on three criterions:
# 1) Convergence, 2) Effective sample size, and 3) Autocorrelation.

# Convergence: Let's first check rhat for each variable by using the az.rhat function. It is an alternative
# way of getting the rhat statistics just as using the pm.summary() function.
az.rhat(trace, var_names=['mu_male', 'mu_female', 'sigma', 'salary_diff'])

# You can see that for all variables, the rhat statistics are very closed to 1.0 - convergence for the difference 
# in salary is well-developed. From the convergence viewpoint, the results from the posterior samples are reliable.
# We can say that the difference in salary between genders are not obtained purely by chance.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

<xarray.Dataset>
Dimensions:      ()
Data variables:
    mu_male      float64 1.0
    mu_female    float64 1.0
    sigma        float64 1.0
    salary_diff  float64 1.0

array(1.00039848)

array(1.00003403)

array(1.00036303)

array(1.00026887)


# Next, we should check autocorrelation. We can use the plot_autocorr function to draw the autocorrelation
# since from the first draw to the last draw of 3000 posterior samples. Note that the maximum sample to show
# is 100 by default and I did not change the number, because in most cases the autocorrelation drops sharply
# such that it becomes invisible at the 100th sample.
az.plot_autocorr(trace, 
                 var_names=['mu_male', 'mu_female', 'sigma', 'salary_diff'], 
                 combined = True)
# We can say that the posterior samples are reliable by looking at the dwindling autocorrelation for each plot. 
# We can't see any patterns of autocorrelation that the latter posterior samples depend on their previous values.
# At the end of the hundred sample autocorrelation, all plots show that the autocorrlation has dropped to around 0.

array([<AxesSubplot:title={'center':'mu_male'}>,
       <AxesSubplot:title={'center':'mu_female'}>,
       <AxesSubplot:title={'center':'sigma'}>,
       <AxesSubplot:title={'center':'salary_diff'}>], dtype=object)


# Now let's assess the effective sample size in a forest plot. Since each chain with high autocorrelation 
# have fewer independent samples than the total number of posterior samples, we can use the plot_forest function and
# customize the ess option to be True to check if the number of effective sample size is above the threshold 200.
az.plot_forest(trace, var_names=['salary_diff'], ess = True, hdi_prob = 0.95)

# You see that the effective size goes high - it's around 10,000, so we get more than enough effective samples
# to obtain stable posterior distribution for the salary difference. 

# Finally we look at the highest posterior density plots showing on the left panel. 
# Each of the three blue lines represents the 95% highest posterior density for each chain. 
# You can see that two out of three lines lie totally on the positive side (you might get different results if you
# run the model many times due to randomness), but one line crosses the 0. 
# So we still need to look at the HDI plot, and as we've seen negative values in the 95% HPD, we do not have
# enough evidence to bolster the claim that gender inequality has taken place in job placement at
# certain area. Great! Now let's further explore the topic about gender equality.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([<AxesSubplot:title={'center':'95.0% HDI'}>,
       <AxesSubplot:title={'center':'ess'}>], dtype=object)


# Research question 2: Does gender affect placement?
# with just substantiate the salary difference between genders. Now does there are significant difference between genders in 
# terms of the probability of getting job placement?

# Now we first store the placement results for men employees with gender equals to 1. 
status_male = placement[placement['gender'] == 1]['status']
# and store another variable call status_female with the placement status for women employees with 
# gender equal to 0. In both cases the value 1 means that individual was placed with a job, and 0
# means that in the individual was not placed with a job.
status_female = placement[placement['gender'] == 0]['status']

# Again, we can create a PyMC3 model called gender_placement hypothesis testing
with pm.Model() as gender_placement_ht:
    
    # We can specify the prior of the mean obtaining degree as the normal distribution as prior and 
    # let's set 60 as mean and 5 as standard deviation
    # Assume same salary for both genders, and let's set the salary as 300,000 and standard deviation 
    # as 50,000 for both mu_male and mu_female
    prob_male = pm.Beta('prob_male', alpha = 30, beta = 30)
    prob_female = pm.Beta('prob_female', alpha = 30, beta = 30)
    
    # Hypothesis: Same placement probability level between genders
    placement_diff = pm.Deterministic('placement_diff', prob_male - prob_female)
    
    # Likelihood: We're here conducting hypothesis testing, so we need to create both male and female
    # estimates to compare their difference based on our data subsetted for male and female.
    y_male = pm.Bernoulli('y_male', p = prob_male, observed = status_male)
    y_female = pm.Bernoulli('y_female', p = prob_female, observed = status_female)
    
    trace = pm.sample(3000, chains = 3, tune = 500)

    # Take a coffee break for 30 seconds

<ipython-input-11-c456e3e05947>:30: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning.
  trace = pm.sample(3000, chains = 3, tune = 500)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 4 jobs)
NUTS: [prob_female, prob_male]

Sampling 3 chains for 500 tune and 3_000 draw iterations (1_500 + 9_000 draws total) took 111 seconds.


# Cool! It took us around 13 seconds to finish the simulation and we can diagnose the posterior samples
# in a traceplot. I saw the deprecating message regarding the traceplot, so for the future visualization,
# I'll change the code to az.plot_trace, add the visualization produced is exactly the same.
az.plot_trace(trace)

# We can see some fluctuations in the prob_male, prob_female and placement_diff variables without 
# special patterns in the right panel, indicating that the mixing of posterior samples is reasonable.
# But looking at the left panel, we do notice some difference regarding the job placement probability. 
# That is, the point associated with the highest posterior density for placement_diff 
# variable located at around 0.08. 
# This means male candidates have about 8% higher chance to get a job offer in the region of analysis.
# Note that in many countries, women are more likely to get hired, so this example won't conclude a saying that
# "Hey, in this Bayesian class we learnt that men are more likely to get hired". The world is exciting and 
# we've substantial amount of data, many metrics available to make comparisons between genders. We

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([[<AxesSubplot:title={'center':'prob_male'}>,
        <AxesSubplot:title={'center':'prob_male'}>],
       [<AxesSubplot:title={'center':'prob_female'}>,
        <AxesSubplot:title={'center':'prob_female'}>],
       [<AxesSubplot:title={'center':'placement_diff'}>,
        <AxesSubplot:title={'center':'placement_diff'}>]], dtype=object)


# Now let's assess the posterior results, combined with the effective sample size and convergence 
# diagnostic r_hat in a forest plot.
az.plot_forest(trace, var_names=['placement_diff'], r_hat = True, ess = True, hdi_prob = 0.95)

# You see that the effective size goes higher than 5000 for the difference in placement probability between genders, 
# so we have sufficient effective samples to obtain stable posterior distribution.
# And we see that the r_hat locates at 1 - so the r_hat metric indicates that all the chains are well converged, 
# allowing us to interpret that the difference in job placement probability is come from sound evidence.

# Finally we look at the 95% highest posterior density interval. We can see that the posterior mean of all three 
# chains hovers around 0.08, and the lowest end crosses the 0 point. Therefore, the average placement
# proportion difference (male - female) is approximately 8 percent - female candidates have lower job
# placement proportion compared to the male candidates. However, in 95% confidence, we do not have sufficient
# evidence to establish that the difference in job placement probability is substantial - again, we can't say
# that the job placement opportunity difference observed in the data indicates significant gender inequality.
# The posterior samples do not provide enough evidence to claim that gender inequality is significant in that area.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([<AxesSubplot:title={'center':'95.0% HDI'}>,
       <AxesSubplot:title={'center':'ess'}>,
       <AxesSubplot:title={'center':'r_hat'}>], dtype=object)


# if we want to know exactly how strong our posterior sample indicates the difference, 
# we can make a posterior plot with the reference value equals 0
az.plot_posterior(trace, var_names = ['placement_diff'], ref_val = 0)

# In this case, you see a quantity "92.8%" showing up in orange text at the right hand side of the reference value 0.
# It does not have to be 92.8% (it's fine if it shows other values in your case), it actually means that we are 92.8%
# confident that the placement opportunities are different between genders. So choosing the level of confidence is
# important, because if we choose a higher level of confidence, we need stronger evidence to develop significance
# difference between groups in Bayesian hypothesis testing. Obviously in this case, 92.8% is smaller than 95%, so 
# it indicates that we can't support our claim with the data we have in the placement data under 95% confidence.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

<AxesSubplot:title={'center':'placement_diff'}>


# Let's now look at the first few rows of the data again.
placement.head(5)

# In the placement data, we've seen there are some people who have prior working experience and some 
# people who have not. We're curious if prior working experience is associated with increased salary
# upon job placement.


# Research question: Does gender affect salary? 

salary_male = placement[placement['gender'] == 1]['salary'].dropna()
salary_female = placement[placement['gender'] == 0]['salary'].dropna()

# We can assume that male has naturally higher salary to justify the first prior setting, and assume
# that no difference should be found between two genders to justify the second prior setting.
priors = ((300000, 50000, 200000, 50000), (300000, 50000, 300000, 50000))
models = []
traces = []

# Let's create a for loop that takes the priors for mean and standard deviation of male salary, and 
# mean and standard deviation of female salary out for each assumption.
for mean_male, sigma_male, mean_female, sigma_female in priors:

    # Again, for each prior, we enwrap the model and use bf_model for the model name
    with pm.Model() as bf_model:

        # We took the model from the last video when we compared the salary distribution between male and female.
        # But this time we substitute the prior values with variables pre-defined with the two assumptions.
        # So for the first model, we model with the prior knowledge that male earns higher (300,000) than female
        # (200,000). For the second model, both genders earn the same salaries (300,000).
        
        # For both models, we set the standard deviation of salary distribution as 50,000 because the salary
        # can differ among each accepted candidate.
        mu_male = pm.Normal('mu_male', mu = mean_male, sigma = sigma_male)
        mu_female = pm.Normal('mu_female', mu = mean_female, sigma = sigma_female)

        # Likelihood: We also assume that the standard deviation for the actual data that come from each prior 
        # setting to be the same. This is because now we're not interested in comparing the notions of how 
        # uncertain the estimates of salaries between genders are. We're only comparing the difference in using
        # "gender-equality" prior and "gender-inequality" prior in order to understand which prior setting 
        # brings about significantly more information to the model fit.
        y_male = pm.Normal('y_male', mu = mu_male, sigma = 10000, observed = salary_male)
        y_female = pm.Normal('y_female', mu = mu_female, sigma = 10000, observed = salary_female)

        # Finally, to obtain Bayes Factor, we use the sequential monte carlo sampler instead of the default
        # NUTS sampler - so the result that we'll get includes not only the posterior mean or variance, but 
        # also an estimation of the marginal likelihood.
        trace = pm.sample_smc(3000)
        
        # Let's append the model result and the trace object to the list we've defined at the beginning.
        models.append(bf_model)
        traces.append(trace)

    # Note that the advantage of sequential monte carlo sampler is that it can even compute the marginal
    # likelihood without knowing the closed-form mathematical expression. But the cost of it is that it 
    # is less efficient compared to NUTS.

Initializing SMC sampler...
Sampling 4 chains in 4 jobs
Stage:   0 Beta: 0.001
Stage:   1 Beta: 0.003
Stage:   2 Beta: 0.010
Stage:   3 Beta: 0.035
Stage:   4 Beta: 0.119
Stage:   5 Beta: 0.403
Stage:   6 Beta: 1.000
Stage:   0 Beta: 0.001
Stage:   1 Beta: 0.003
Stage:   2 Beta: 0.010
Stage:   3 Beta: 0.035
Stage:   4 Beta: 0.118
Stage:   5 Beta: 0.408
Stage:   6 Beta: 1.000
Stage:   0 Beta: 0.001
Stage:   1 Beta: 0.003
Stage:   2 Beta: 0.010
Stage:   3 Beta: 0.033
Stage:   4 Beta: 0.116
Stage:   5 Beta: 0.401
Stage:   6 Beta: 1.000
Stage:   0 Beta: 0.001
Stage:   1 Beta: 0.003
Stage:   2 Beta: 0.011
Stage:   3 Beta: 0.036
Stage:   4 Beta: 0.121
Stage:   5 Beta: 0.402
Stage:   6 Beta: 1.000
Initializing SMC sampler...
Sampling 4 chains in 4 jobs
Stage:   0 Beta: 0.001
Stage:   1 Beta: 0.005
Stage:   2 Beta: 0.018
Stage:   3 Beta: 0.061
Stage:   4 Beta: 0.209
Stage:   5 Beta: 0.733
Stage:   6 Beta: 1.000
Stage:   0 Beta: 0.001
Stage:   1 Beta: 0.005
Stage:   2 Beta: 0.019
Stage:   3 Beta: 0.064
Stage:   4 Beta: 0.226
Stage:   5 Beta: 0.766
Stage:   6 Beta: 1.000
Stage:   0 Beta: 0.001
Stage:   1 Beta: 0.005
Stage:   2 Beta: 0.018
Stage:   3 Beta: 0.063
Stage:   4 Beta: 0.223
Stage:   5 Beta: 0.773
Stage:   6 Beta: 1.000
Stage:   0 Beta: 0.001
Stage:   1 Beta: 0.005
Stage:   2 Beta: 0.018
Stage:   3 Beta: 0.062
Stage:   4 Beta: 0.208
Stage:   5 Beta: 0.713
Stage:   6 Beta: 1.000


# Cool! Now let's do some Bayes Factor inference. We can compute the Bayes Factor by
# subtracting the log-likelihood between two models, and then take its exponential value by the np.exp function.
BF_smc = np.exp(traces[0].report.log_marginal_likelihood - traces[1].report.log_marginal_likelihood)
BF_smc

# We can see that the Bayes Factors are smaller than 1. I'm surprised about this result! Let's recall the Bayes
# Factor table, it means that there are anecdotal evidence - or weak evidence, that the prior with same salary
# distribution will outperform the prior with differential salary distribution between genders.
# Using an assumption that male earns more than female will give the worse posterior estimates. 
# However, how different the posteriors we get from these two models are?

array([0.51811128, 0.52678248, 0.46620985, 0.51487289])


# Let's look at the summary for the model with unequal prior.
az.summary(traces[0]).round(2)

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(


# and the summary for the model with equal prior.
az.summary(traces[1]).round(2)

# We might argue that two models are actually very similar even with different priors, and both models 
# converge and the posterior estimates are reliable with more than 10,000 effective samples.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(


# We can also use the following code to compare the posterior protective of male salary distributions given from
# the assumption that both two prior given to the model. Let's run it directly.

_, ax = plt.subplots()
ppc_0 = pm.sample_posterior_predictive(traces[0], 100, models[0], size = (100, 20))
ppc_1 = pm.sample_posterior_predictive(traces[1], 100, models[1], size = (100, 20))

for m_0, m_1 in zip(ppc_0['y_male'].T, ppc_1['y_male'].T):
    az.plot_kde(np.mean(m_0, 0), ax = ax, plot_kwargs = {"color": "C0"})
    az.plot_kde(np.mean(m_1, 0), ax = ax, plot_kwargs = {"color": "C1"})
    
ax.plot([], label = "model_different_prior")
ax.plot([], label = "model_same_prior")
ax.legend()
ax.set(xlabel = "male salary prediction")

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/pymc3/sampling.py:1689: UserWarning: samples parameter is smaller than nchains times ndraws, some draws and/or chains may not be represented in the returned posterior predictive sample
  warnings.warn(

<matplotlib.legend.Legend at 0x7f8c7ceeee50>


# We can use the same code to compare the posterior protective of female salary distributions given from
# the assumption that both two prior given to the model. Let's run it directly.

_, ax = plt.subplots()
ppc_0 = pm.sample_posterior_predictive(traces[0], 100, models[0], size = (100, 20))
ppc_1 = pm.sample_posterior_predictive(traces[1], 100, models[1], size = (100, 20))

for m_0, m_1 in zip(ppc_0['y_female'].T, ppc_1['y_female'].T):
    az.plot_kde(np.mean(m_0, 0), ax = ax, plot_kwargs = {"color": "C0"})
    az.plot_kde(np.mean(m_1, 0), ax = ax, plot_kwargs = {"color": "C1"})
    
ax.plot([], label = "model_different_prior")
ax.plot([], label = "model_same_prior")
ax.legend()
ax.set(xlabel = "female salary prediction")

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/pymc3/sampling.py:1689: UserWarning: samples parameter is smaller than nchains times ndraws, some draws and/or chains may not be represented in the returned posterior predictive sample
  warnings.warn(

[Text(0.5, 0, 'female salary prediction')]


# # Information Criterion: AIC, BIC, DIC and WAIC
# https://docs.pymc.io/notebooks/model_comparison.html

# az.loo function to check the information criteria for the first model.

az.loo(trace, gender_placement_ht)

# You can see that the estimate for elpd_loo is -7.48, with a standard error of 0.58.
# And the probability underlying the loo (Leave one out cross-validation) is 0.53.
# When we compare models, we want to choose the LOO values that is the lowest because this is computed through
# a transformed scale such that the lower values indicate a better fit of the Bayesian model to the actual data.

	sl_no	gender	ssc_p	ssc_b	hsc_p	hsc_b	hsc_s	degree_p	degree_t	workex	etest_p	specialisation	mba_p	status	salary
0	1	M	67.00	Others	91.00	Others	Commerce	58.00	Sci&Tech	No	55.0	Mkt&HR	58.80	Placed	270000.0
1	2	M	79.33	Central	78.33	Others	Science	77.48	Sci&Tech	Yes	86.5	Mkt&Fin	66.28	Placed	200000.0
2	3	M	65.00	Central	68.00	Central	Arts	64.00	Comm&Mgmt	No	75.0	Mkt&Fin	57.80	Placed	250000.0
3	4	M	56.00	Central	52.00	Central	Science	52.00	Sci&Tech	No	66.0	Mkt&HR	59.43	Not Placed	NaN
4	5	M	85.80	Central	73.60	Central	Commerce	73.30	Comm&Mgmt	No	96.8	Mkt&Fin	55.50	Placed	425000.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
210	211	M	80.60	Others	82.00	Others	Commerce	77.60	Comm&Mgmt	No	91.0	Mkt&Fin	74.49	Placed	400000.0
211	212	M	58.00	Others	60.00	Others	Science	72.00	Sci&Tech	No	74.0	Mkt&Fin	53.62	Placed	275000.0
212	213	M	67.00	Others	67.00	Others	Commerce	73.00	Comm&Mgmt	Yes	59.0	Mkt&Fin	69.72	Placed	295000.0
213	214	F	74.00	Others	66.00	Others	Commerce	58.00	Comm&Mgmt	No	70.0	Mkt&HR	60.23	Placed	204000.0
214	215	M	62.00	Central	58.00	Others	Science	53.00	Comm&Mgmt	No	89.0	Mkt&HR	60.22	Not Placed	NaN

	sl_no	gender	ssc_p	ssc_b	hsc_p	hsc_b	hsc_s	degree_p	degree_t	workex	etest_p	specialisation	mba_p	status	salary
0	1	1	67.00	Others	91.00	Others	Commerce	58.00	Sci&Tech	0	55.0	Mkt&HR	58.80	1	270000.0
1	2	1	79.33	Central	78.33	Others	Science	77.48	Sci&Tech	1	86.5	Mkt&Fin	66.28	1	200000.0
2	3	1	65.00	Central	68.00	Central	Arts	64.00	Comm&Mgmt	0	75.0	Mkt&Fin	57.80	1	250000.0
3	4	1	56.00	Central	52.00	Central	Science	52.00	Sci&Tech	0	66.0	Mkt&HR	59.43	0	NaN
4	5	1	85.80	Central	73.60	Central	Commerce	73.30	Comm&Mgmt	0	96.8	Mkt&Fin	55.50	1	425000.0

	sl_no	gender	ssc_p	ssc_b	hsc_p	hsc_b	hsc_s	degree_p	degree_t	workex	etest_p	specialisation	mba_p	status	salary
0	1	1	67.00	Others	91.00	Others	Commerce	58.00	Sci&Tech	0	55.0	Mkt&HR	58.80	1	270000.0
1	2	1	79.33	Central	78.33	Others	Science	77.48	Sci&Tech	1	86.5	Mkt&Fin	66.28	1	200000.0
2	3	1	65.00	Central	68.00	Central	Arts	64.00	Comm&Mgmt	0	75.0	Mkt&Fin	57.80	1	250000.0
3	4	1	56.00	Central	52.00	Central	Science	52.00	Sci&Tech	0	66.0	Mkt&HR	59.43	0	NaN
4	5	1	85.80	Central	73.60	Central	Commerce	73.30	Comm&Mgmt	0	96.8	Mkt&Fin	55.50	1	425000.0

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
mu_male	298911.29	1005.05	297122.55	300910.86	9.02	6.38	12429.0	11808.0	1.0
mu_female	267221.49	1462.27	264475.04	269938.47	13.30	9.40	12092.0	11481.0	1.0

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
mu_male	298903.43	1007.02	297018.45	300782.17	9.20	6.51	11967.0	11733.0	1.0
mu_female	267321.32	1433.62	264752.59	270118.29	13.06	9.23	12056.0	11930.0	1.0

Video 1: Bayesian Hypothesis Testing Concept¶

Question 1: What is Bayesian Hypothesis Testing?¶

Question 2: What are different types of Bayesian Hypothesis Testing methods?¶

Question 3: What are null hypothesis and alternative hypothesis?¶

Question 4: What are different categories of hypotheses testing?¶

Video 2: Bayesian Hypothesis Testing - Group Comparison¶

Data Variable Description:¶

Research Questions:¶

Question 1: What posterior diagnostics are necessary to interpret the results of Bayesian hypothesis test?¶

Question 3: Why do Bayesian Hypothesis Testing useful?¶

Acknowledgement¶

Video 1: Bayes Factor¶

Question 1: What is the motivation to introduce Bayes Factor?¶

Question 2: What is Bayes Factor?¶

Research question: Does having prior working experience generally mean a candidate will receive higher income?¶

Question 3: How does Bayesian analysis practitioners quantify Bayes Factor to determine the strength of evidence?¶

Video 4: Model Comparison: Information Criteria¶

Question 1: Why do we need to compare models?¶

Question 2: What are common methods of comparing Bayesian models? Which method(s) are typically Bayesian?¶

Question 3: How to quantify the difference among models and choose the best one?¶

DIC, WAIC, WBIC on hypothesis testing tasks¶