Bayesian Estimation Supersedes the T-test (BEST)

Question 1: What kind of question does BEST solve?

If you work as a data scientist, you have ample opportunities to compare the performance metrics from different groups and interpret how differences will lead to new decision makings. In this lecture, we are going to compare the batting average performance from two MLB players using the Bayesian Estimation Supersedes the T-test (BEST). BEST is a Bayesian model that can be used where you classically would use a two-sample t-test. BEST estimates the difference in means between two groups and yields a probability distribution over the difference.

Question 2: What are the procedure of implementing BEST and diagnostics that BEST offer?

Using BEST to compare different groups has two main steps. First, we will find number one, proper prior distributions, which specify our belief of two player's performance before observing the actual data, number two, choose reasonable likelihood that reflects all relevant information about the data, and find the posterior distribution, which is an updated belief. These procedures can all be done in one model using the probabilistic programming package called PyMC3.

Second, using the posterior distribution, we can compare the differences between two MLB players. We analyze the differences by drawing the posterior distribution plot and the forest plot and see the distribution of difference of means, difference of standard deviations and finally, effect size based on the posterior distribution. This Bayesian type of t-test can not only estimate how different the two groups are, but also how uncertain the differences.

Question 3: Research Question

There may be various interesting questions to get an answer from one of the most exciting game - American baseball game. Many people are so intrigued in discussing what determines who plays better as a Major League Baseball (MLB) hitter. I listed a few frequently asked questions.

Keep these in mind and let's do some coding!

Refined research question:

What we just print out is Batting average (BA). It takes a player's total hits (Bat Outcome == 1) and divides them by total times at bats. The successful hit for Yuli and Vlad accounts for slightly less than 30 percents. An average players in current MLB league hits 27%. Now our data contains batting performance for two players, so let's refine our research. Given the batting performance of Yuli Gurriel and Vladimir Guerrero Jr., who's batting average is higher? Does any player hit significantly higher proportion of balls in the last few seasons?

Now let's estimate the difference in performance between Yuli and Vlad at bat. This time we also time the modeling process to look at the modeling efficiency.

Posterior Diagnosis

It took about 41 seconds to complete the model sampling process. Now that the trace object is computed through the NUTS sampler, and it provides (the number of iteration which is 3000, times the number of chain, which is 4, so a total of 12,000) combinations of parameter values. Each combination of values is representative of credible parameter values that simultaneously accommodate the observed data and the prior distribution.

Looking at the group differences below, we can conclude that there are meaningful differences between the two groups for all three measures. For these comparisons, it is useful to use zero as a reference value (ref_val); providing this reference value yields cumulative probabilities for the posterior distribution on either side of the value. Thus, for the difference of means, at least 97% of the posterior probability are greater than zero, which suggests the group means are credibly different. The effect size and differences in standard deviation are similarly positive.

When forestplot is called on a trace with more than one chain, it also plots the potential scale reduction parameter, which is used to reveal evidence for lack of convergence; values near one, as we have here, suggest that the model has converged.

Alright! In this lecture we've discussed how to use Python to visualize credible intervals in a Bayesian t-test to effectively communicate the posterior distribution, our new belief to the audience. We emphasized the importance of visualizing the highest density interval since we can then bog down the complexity in communicating the parameters in BEST given its clear graphical representation of the posterior interval for the difference of means of two comparable groups, difference of standard deviations, and the effect size that tells us statistically how strong the difference by two groups. The highest density interval is applied so that we capture in our case study, the 95% posterior density with the shortest interval, best representing the central tendency of the posterior distribution. Bye for now!

Created in deepnote.com Created in Deepnote

Does that necessary mean Vlad will perform much better because Yuli missed all 5 hits at the beginning? Probably not. Fortunately, the dataset actually contains more than 1000 hits from each player, so I would say it is quite a representative sample to compare two player's performance.

For those who are familiar with the MLB history, most of the time the batting averages over a season hovers somewhere between 0.220 and 0.360, with just a few extreme exceptions on either side. So it's possible when a player gets a few strikeouts in a row or get a few consecutive hits.

All three of these intervals, one from 27.8% to 34.7%, another from 0% to 34.1% and the last one from 28.4% to 100% are 95% credible intervals, so in other words, all these intervals are plausible values that represent 95% plausibility. But they're all different. So the question is, which interval should be choose to communicate our belief about the posterior distribution? I've worked with many statisticians across different universities and the conclusion is it depends specifically on the goal of solving the problem.

Here is a common rule of thumb. In previous lecture we've shown that when more data comes in, when you have more evidence, the posterior variance will decrease gradually, so a narrower posterior distribution is an indication of stronger belief. In other words, it indicates we're more certain that the true parameter (e.g. batting average) falles into a specific range, instead of saying it's plausible at everywhere just like random guessing. Using a similar logic, we generally choose the shortest credible interval that represents for example, 95% of the plausibility to communicate the strongest belief we could show given the posterior distribution.