Video 1: Bayesian Hypothesis Testing Concept

In this video we'll introduce the high-level concepts of Bayesian hypothesis testing and introduce several commonly used methods in Python code examples.

Question 1: What is Bayesian Hypothesis Testing?

Hypothesis testing is a commonly used approach for engaging in statistical analysis. In statistics, a hypothesis is a premise, or a claim that we want to test or investigate, and one way we are going to investigate (we can't always go to a lab and do experiments) is that we can go ask a survey of several hundred people and get some data, and we can use those data as representative samples to study the population. So when we talk about hypothesis testing, or investigating on a claim, we are talking about doing some sampling, getting information, and then testing the hypothesis. So, much like in science we have an idea we have a problem and want to test that idea, in statistics, you have to pick out the hypothesis from reading the problem, and come up with a test that we want to investigate the problem. The goal of hypothesis testing, therefore, is to make decision based on what data would suggest what we want to investigate the problem. Back to the 1990s, the idea of coming up with Bayesian inference and deriving posterior density for analysis was a rocket science. It was frustrating experience to test claims solely with mathematics but not using efficient computational tools that nowadays we enjoy.

Now as probabilistic programming languages like PyMC3 and JAGS starts to blossom, Bayesian methods for estimation starts to prevail, and there's no exception for hypothesis testing.

Question 2: What are different types of Bayesian Hypothesis Testing methods?

There are no one-size-fits-all way of conducting Bayesian hypothesis testing. However, all testing methods leverage 1) the prior distribution, 2) the likelihood function and 3) the posterior distribution derived from either the Bayes' rule or MCMC simulation to make any informed decision. This makes testing statistical claims under a combination of various knowledge sources instead of only depending on a single study, therefore Bayesian hypothesis testing uses more reliable evidence and should be more convincing.

Bayesian hypothesis testing methods are categorized in three main types, namely

Question 3: What are null hypothesis and alternative hypothesis?

Okay! Now we've learnt what is conceptually a Bayesian hypothesis test. We know that for each hypothesis test we either have enough evidence to support a claim or do not have sufficient evidence to support a claim. Null hypothesis is the default hypothesis in which we would like to test against with evidence. For example, if we are finding evidence from data to support that gender inequality is happening in any part of the world, null hypothesis represents the default claim that there are no difference between genders. The alternative hypothesis is the opposite of the null hypothesis. In this case, the alternative hypothesis claims that there are gender inequality supported by the data. Therefore, each positive testing problem has a dual hypothesis - no hypothesis and the alternative hypothesis. Let's come back to our example. So we are testing the dual hypothesis as follows:

Question 4: What are different categories of hypotheses testing?

For example, gender equality has raised tremendous dialogue and campaigns among the world. It is the state of equal ease of access to resources and opportunities regardless of gender, including economic participation and decision-making. If you want to figure out if female employees are paid less than male employees in a certain country, or area, you are involved in a two-sample one-sided hypothesis testing. The definition of two sample means that we use two samples in testing the hypothesis - the male employee sample, and the female employee sample. Chances are that your claim is bolstered by data evidence, or your claim is not supported by the data. However, if you want to prove that if female employees are paid either much more than, or much less than male employees in a certain country, then you are conducting a two-sample two-sided hypothesis testing. Chances are that the your claim is supported because either female employees are paid too low or too high compared to the male employees, and otherwise your claim does not have sufficient evidence to support because the salaries might be just slightly different between genders; and the difference is not substantial.

No one could actually foresee the result of hypothesis testing, let alone the variability of the evidence, until we can write scripts to combine our prior thinking and the data samples such that the Bayesian model is implemented.

Hopefully you enjoyed the video and learned something meaningful about Bayesian hypothesis testing. Let's start to code in the next video!

Video 2: Bayesian Hypothesis Testing - Group Comparison

In this lecture, we'll dive into several research questions regarding the gender equality issues for testing a hypothesis in campus placement. We basically compare a different group of inputs and see the posterior distribution are they significantly different from each other. Do they have overlaps? Let's start coding!

Data Variable Description:

Here our data has 215 records and 15 columns in total to and let's see the data verbal description list.

Research Questions:

In this tutorial, we'll analyze a data about campus recruitment to answer -

Question 1: What posterior diagnostics are necessary to interpret the results of Bayesian hypothesis test?

Question 3: Why do Bayesian Hypothesis Testing useful?

Up till now we've seen how to conduct Bayesian hypothesis testing to compare male and female candidates in terms of job placement probability and average salary. They are comparative in nature and the advantage of Bayesian estimation are two-fold.

In this lecture, you've seen how to test a hypothesis using PyMC3. Bayesian hypothesis testing provides a structured way to quantify logical processes that we do in our daily life. One last minute note is that if you disagree with the prior I provide, it is totally acceptable - then we can observe the same data, but may reach different conclusions. In many situations, the logic of Bayesian thinking is similar to how we think naturally - we update our initial belief by constantly incorporating new evidence. Make sure we've now learnt how to specify problems clearly, execute an appropriate model, customize the posterior plots (e.g. with the ref_val, hdi_prob and rope options to point the null hypothesis quantity as reference value specify the level of credibility and the region of practical significance where the quantities should fall outside such that we can confidently choose in favor of the alternative hypothesis). I hope you enjoy the first Bayesian hypothesis testing example. I'll see you in the next video!

Acknowledgement

I would like to pay special thanks to Dr. Dhimant Ganatara, who is a Professor at Jain University, for creating this dataset in Kaggle that I used it for Bayesian analysis.

Video 1: Bayes Factor

Question 1: What is the motivation to introduce Bayes Factor?

In this lecture, we'll discuss another type of hypothesis testing, a proportional type of hypothesis testing we called Bayes Factor. We will outline an experiment and obtain Bayes Factor to test competing hypotheses and use posterior distribution results from previous experiments to update prior distributions for subsequent experiments.

In the group comparison video, we have examined the campus recruitment data to find out if gender inequality exists in terms of worker's salary and the chance of job placement. We compared the average salaries and job placement probability between male and female, using the 95% highest posterior density interval as evidence to make our conclusion. Some people in Bayesian statistics, however, prefer comparing competing claims. Rather than comparing gender groups for salary, there is a need to test pairs of competing claims and figure out under which claim we can learn more information from the data.

Question 2: What is Bayes Factor?

This notion guides us to today's topic - Bayes Factor. Introduced by Harold Jeffreys in the 1960s, Bayes factor’ is a Bayesian method of hypothesis testing that expresses the relative odds of two or multiple competing hypotheses, and is usually used to determine which model better fits the data (Jeffreys 1961). Bayes Factor is a ratio that represents the amount of information that we've learnt about our competing hypotheses from the data.

Research question: Does having prior working experience generally mean a candidate will receive higher income?

We might have mixed perspectives on this question. In our job market, many production line companies offer higher wages to new hires than they're paying for more senior employees - this is a form of pay compression. Other companies like tech companies tend to pay for new hires slightly lower than that of current employees if they both work for the same position. One reason of paying higher salary to new hires is that companies want to attract potential workers to retain and work long term. If we want to compare different hypothesis (say one believes that new hires receive a higher paycheck than senior employees and one believes the opposite), we want to look at the ratio of two hypotheses. We are interested in how much the observation of placement data could change the first belief versus the second belief, regarding whether prior working experience will add assets to one's salary or diminish one's salary.

This is a very typical example of how Bayes Factor can compare the probability of competing hypotheses given the data we observed. We extend the notion of wanting to know if the data give evidence to difference of salary distribution due to gender difference (which is an absolute concept), to compare which prior belief, or which hypothesis allows you to gain for information from data compared to the competing hypothesis.

Question 3: How does Bayesian analysis practitioners quantify Bayes Factor to determine the strength of evidence?

The Bayes factor for comparing two models may be approximated as the ratio of the marginal likelihood of the data that is compared against two different hypotheses. So instead of computing the posterior distribution of a certain parameter using the Bayes' rule, we need to compute the ratio of two posterior probability of data - the marginal likelihood. In order to calculate the ratio - Bayes Factor, we require the prior distribution of both hypotheses and the posterior distribution sampled by the MCMC algorithms in PyMC3. Now let's look at the graph - how to quantify the strength of evidence by Bayes Factor.

Now let's explore Bayes Factor with an example.

Bayes Factor Source: https://zj-zhang.github.io/2016/12/31/Bayes-Factor/

Although both models appear to be quite consistent for the salary distribution among different genders. We can still observe that the actual data is more concentrated with the model with the same prior assigned to both genders. We can say that this difference is captured by the Bayes factors, although the evidence is just anecdotal. In this example, we can also see that it is possible to have two different models, with different Bayes factors, but have quite similar predictions. Actually, if the number of observed data is high, it can provide sufficient information to reduce the prior effect on the posterior distribution.

Bayes factor plays a significant role in Bayesian hypothesis testing. We've seen we compared two prior assumptions on the model. But different from standard hypothesis testing, it compute the relative probabilities of the models but not directly measuring the differences. So it's fine that we sometimes feel uncertain that one hypothesis over the two must be true or not. Hopefully you've learnt the techniques and the mindset of using Bayes factor. Let's see each other in the next video!

Video 4: Model Comparison: Information Criteria

Question 1: Why do we need to compare models?

We need to determine which station model can best represents the situation amount different choices that we are coming up with. In this lecture, we will going to look at several methods in order to find out the best model among different choices. First we start with information criterion.

Question 2: What are common methods of comparing Bayesian models? Which method(s) are typically Bayesian?

There are several information criteria in which typically Bayesian statistics are typically used to compare different model specifications. In Bayesian inference, the most common method of assessing the goodness of fit of an estimated statistical model is Deviance Information Criteria (DIC) and the Widely Applicable Information Criterion (WAIC).

DIC may be compared across different models and even different methods, as long as the dependent variable does not change between models, making DIC the most flexible model fit statistic

DIC is valid only when the joint posterior distribution is approximately multivariate normal. Models should be preferred with smaller DIC. Since DIC increases with model complexity (pD or pV), simpler models are preferred.

The Widely Applicable Information Criterion (WAIC) is an information criterion that is more fully Bayesian than DIC. The result of WAIC more closely resembles leave-one-out cross-validation.

Question 3: How to quantify the difference among models and choose the best one?

when selecting the best models to represent the problem, we typically first choose the one with the smallest information criteria value.

DIC, WAIC, WBIC on hypothesis testing tasks

Both DIC and WAIC are related to out-of-sample/generalization prediction. I think this is a general good metric to evaluate models, even when you care more about the parameters than about the predictions. The general idea is that if your model and parameters are a good description of the underlaying phenomena or process that you are studying they should be able to predict unobserved (but observable) future data.

If you get a warning you have a couple of options (besides ignoring them) to use other methods, like use LOO instead of WAIC (or vice versa), use K-fold cross validation, change your model and use one that is more robust. Of course to compare your models you can also add to the mix posterior predictive checks (although this in an in-sample analysis.) and background information. A little bit more about the warnings. Those are based on empirical observations. Is my opinion that we need more work on this, but as this point this the best thing we have. I have been thinking in adding tools to help diagnose or at least visualize the problematic points, thanks for reminding me about this! Notice that when using DIC/BPIC you always get a nice result without any warnings (even in assumptions are not met) and that could lead to overconfidence!

DIC assumes the Posterior is Gaussian, the more you move away from this assumption the more misleading the values of DIC will be. Someone corrects if I am wrong, but is my understanding that hierarchical models tend to have non-gaussian Posteriors. Also WAIC is more Bayesian because you are averaging over the posterior distribution.

I hope you've gained a fundamental understanding about Bayesian hypothesis testing at this point. We've shown two Bayesian hypothesis testing paradigms - 1) traditional hypothesis testing through comparing the HPD interval and 2) proportional where we compare two competing hypotheses through Bayes Factor. Using the Bayes Factor should be justified in many group research scenarios - whenever a group has not reached consensus on the background knowledge, we might want to test its sensitivity, potential to gain information from the data. Congratulations! Let's continue exploring Bayesian estimation in the next video!