Video 1: Bayesian Regression Concept

Question 1: What are the main objectives for data scientists to use Bayesian regression?

In this module, we will learn a cool concept called Bayesian regression. Regression is central to the practice of statistics. It is part of the core knowledge expected to apply for any statistician and data scientist because most of us are curious about explaining the relationships. We wanna understand for example whether we spend more time studying will lead to a better exam result or mastery of knowledge. And we also want to know how strongly the time spent on study affects learning outcomes. In short, we want to find a relationship between factors and the outcome.

Question 2: What are the major components of Bayesian regression models?

I'll give you a quick snapshot of the Bayesian regression framework including some of the assumptions we are going to need and then we'll talk about how to attack the regression problem using the Bayesian framework. So what are the components of Bayesian regression models? Every time we look at a dataset we get multiple features in it. Let's consider the following research question: What makes a popular Ted Talk? Apparently there are several factors we might imagine contributing to the popularity of TED Talks. They could be the number of positive ratings received, the length of the talk, who the speaker is, the title of the talk, or even a number of speakers invited to the talk. How do these factors independent variables, or predictive features. And obvious, our goal is find out the relationship of them to the popularity of Ted Talks, so the popularity of Ted Talk (let's say the number of views the Ted Talk) is defined the dependent variable of the regression model.

Question 3: Why do we want to use the Bayesian methodologies in regression analysis?

What is the biggest motivation of using Bayesian methodologies for regression analysis is that it's really flexible. In Bayesian regression, the probabilities can be based on degree of belief. Our prior is a guess of the model parameters over the independent variables based on expert knowledge. Maybe you know something about a problem. Maybe you had some previous information that we are not going to be collecting for now but you do want to use them in the model. After all, we want to update the belief with new information (like a dataset) in a very nice manner, determining what model we choose, we want to solve problems with different natures, and that other statistical methods might be difficult to approach.

Following the Bayes' rule taken from course 1, the graph shows that instead of just depending on the data, Bayesian methods include all prior information, likelihood (a specification on data) to obtain posterior diagnostics in Bayesian regression. Since posterior diagnostics contain not only the point estimates such as posterior mean and posterior variance, but also 1) how correlated the posterior samples based on the previous value, 2) show if each simulation of posterior samples are similar in distribution, 3) the probability density corresponding to different posterior estimate, so the Bayesian model is showing richer information compared to various statistical paradigms. Because of this, you can gain more insights from the analysis that otherwise we won't understand.

The probabilistic programming package PyMC3 offers various functionalities to build Bayesian regression models, and the language PyMC3 uses is much easier to implement among packages which supports Bayesian analysis. Similar to updating probability distributions in week one, PyMC3 also uses Markov Chain Monte Carlo simulations to return the best prediction distribution for each predictive feature in Bayesian regression model. It will tell us not only the posterior distribution for each predictive features, but also show if each independent simulation chain of posterior samples are similar to each other. Cool!

So hope you enjoy the very short introduction about Bayesian regression. I'm going to move on to coding. Only the last minute emphasis! In this specialization, I'll definitely not provide ideas and research questions that dive into spurious relationships. For instance, I won't teach modeling to find out the relationship between average chocolate consumption by country and the number of Nobel laureates in that country. It's because becoming a data scientist, we need to be critical when reading articles from various sources and when proposing problems, we need to make sure the relationships we're going to explain to the audience are logical and reasonable. We should never deceive ourselves or other people with spurious relationships that do not benefit the understanding about our world. Cool! Now that we have discussed some general ideas about regression. I'll see you in the live coding demos!

Video 2: Regression Sampling Process

Now that we have discussed some general ideas about Bayesian regression. Let's begin to learn how we can build linear regression models. Now take a look at the sampling process of regression models.

Question 1: What assumptions should the data meet in order to perform Bayesian regression?

First of all to estimate a regression model, we should specify a probability distribution for the data (let's say y1, y2, until yn are the outcome data) such as normal, Poisson, binomial. The Bayesian approach necessitates us to specify the prior distributions of not only the parameters, but also a homoscedastic error term, which is a constant variability of each outcome.

In this case we are doing a linear regression. We have two main assumptions as we'll find out the best values of regression coefficients to adapt models to predict TED Talk views. The Bayesian linear regression models of Ted talk views have an expression form as follows:

To make things clear, the dependent variable TED Talk views is assumed to be distributed as a normal distribution. And the statement enclosed with parentheses indicate that we predict the mean value of TED Talk views as a linear combination of regression coefficients times predictors plus the intercept, whereas the standard deviation is a positive number that represents how great the variability, how fluctuate the expected TED Talk views is going to be.

Question 2: How to implement Bayesian regression using the PyMC3 package?

So, our DataFrame contains a total of 2550 rows, and 17 variables where each row contains a data record of a particular Ted talk. It's really kind of rich data. We have 17 variables regarding the following features:

Cool! It seems like a long list of columns. Do we need all these features to understand what factors are contributing to the popularity of Ted talks? Not really! So let's first do a little visual inspection of the attributes we are interested in.

Research Question:

Up till now we've pre-processed the data. Well let's simplify our research question a bit for our tutorial purpose. Can we predict the popularity of the TED talk (i.e. the views of a TED talk) if we know the length of the video, how many languages involved, and the amount of comments following the talk? Which of these factors significantly contribute to the popularity of Ted talks?

During the sampling process you can take a coffee break. Okay! Now you've created the first Bayesian linear regression model and the model took around three minutes to sample numerically the posterior distribution. This sampling process says sampling 3 chains for 2000 tune and 4000 draw iterations. We say we recruit three independent groups of Ted talks, each with 2000 + 4000 talks in quantity and only the last 4000 talks are returned as sampling. Each chain actually represents a simulation, a different world of TED talk samples, and our model returns three simulated results in total.

Let's draw another regression plot that can shed light on how strong the linear relationship can describe the predicted views and the actual views from data.

In this video, we've demonstrated how to run a Bayesian linear regression using PyMC3. In any case, after seeing a few negative results in the predicted outcome, you could consider that one big improvement of the regression model you see in this tutorial can be fulfilled if we can ensure that the predictive video views are all positive because none of the video should have negative views. Now we are uncertain about how powerful this model does, so let's move on to the next video where we'll conduct some posterior diagnostics.

Video 3: Regression: Posterior Statistics

Anytime we build Bayesian models, we need to derive a posterior distribution. This contains all necessary information that we're interested in about our parameters according to the data and a model and is chiefly the purpose of training data and using PyMC3 algorithms for sampling.

Question 1: Why are posterior statistics important for Bayesian analysis?

Coming back to the definition, posterior distribution represents the updated belief after taking in new evidence. In other words, it combines both our entire prior knowledge represented as prior distribution and new evidence that is used to specify the observed option as likelihood. Remember that the purpose of conducting Bayesian analysis overall is accumulating knowledge and updating beliefs. Each time the posterior statistics perspective characterizes our updated belief when new data is available, so that posterior can be reused for next round of modeling if we think of the infinite, automatic modeling process called chaining Bayes' rules.

Nowaday we're interested in continuous deployment and integration, if you're familiar with these techniques in big data, this typically resembles leveraging the posterior distribution as the characteristics or configurations of the next deployment. Simply put, posterior statistics help us interpret experimental results by providing rich inference, and it helps us move on to the next experiment.

Question 2: What are some posterior statistics that are necessary to gain actionable insights?

Now we've obtained a trace object and this is the output results generated from MCMC sampling using pm.sample() and the data type of which is a MultiTrace, a special data type storing the posterior values as well as the trajectory.

The first step after sampling the Bayesian regression model is to inspect the posterior probability distributions for each individual parameter in the model given the trace object.

Based on the graph, we can make posterior-based interpretations as the following:

Long story short, these pair plots are great visualizations to identify trends that help explain the interaction between multiple variables.

In this video, we've introduced how to use posterior plot, pair plots to demonstrate the posterior distribution using the posterior samples from the trace object. But most importantly, I hope you have a better understanding of how to leverage posterior statistics to vividly interpret modeling results to a variety of audiences because communication and knowledge delivery are still the ultimate goal for a data scientist.

As you've seen in the last density plot, posterior inference not only helps us understand how strongly each TED talk features interact with views every time new data comes in, but also helps others learn new knowledge and make predictions. As a practice, I recommend you to try plugging different scenarios (e.g. a 20-minute video that's attracted 1000 comments) and see how the predicted views are sensitive to each video condition.

Video 4: Regression: Traceplot

In this video, we'll zoom in on the discussion of traceplot. Traceplot is a main diagnostic for MCMC simulation.

A major question frequently asked about regression is: are the data we choose sufficient to precisely identify the complex model involving multiple influences on the response - the Ted talk views. Are the estimates for the model robust to the changes in prior specification or the influence of particular data observations (e.g. videos that are mostly viewed or leastly viewed)?

Question 1: What are the major components of the traceplot?

There are two main visualizations given in the traceplot. On the left, we get a kdeplot that looks smooth except with several curves with different line types. It shows the posterior density given by the posterior samples so it shows the distribution of the posterior for each predictors, including the intercept.

The right hand side shows the trajectory of multiple Monte Carlo simulations with each chain. In normal cases, it should look somewhat similar to the white noise.

Question 2: How to infer convergence/divergence through traceplot?

On the right panel, the x-axis shows the number of samples in each trajectory and the y-axis shows the posterior sample values. We should not see any recognizable pattern because we want to have good agreement between these three chains, which indicates to us that the sampling process went smoothly. Nor do we want to see a curve going up or down because we want each chain converges to and meanders around a single value. In this case, the video duration effect meanders around 0.02 and looks quite similar to a white noise, so the variables are stable. If we pick a longer video, we are confident to say we'll gain 20,000 audience per minute increases in video length. We won't get weird regression results like we get 60,000, 70,000 or even 100,000 just because we need more samples.

On the left panel, among traces, we see that all five variables have similar traces for each curve. This is demonstrated when we can't observe a single trajectory running out of other trajectories. This shows that our model convergence is in a good shape - although different simulations are used from the sampling process, the prediction remains steady. We won't be worried that we can't mix the chains and trust the posterior distribution we found.

Question 3: On what occasions will the posterior distribution of parameters look stable?

Although Bayesian models are more robust than most statistical methods, there might still be cases that outliers and influential observations (i.e. abnormal data points deviated from the main part of the sample) can diminish the quality of posterior diagnostics. By saying robust, a model is considered to be robust if its output predictions are consistently accurate. This includes the case when one or more of the input data are drastically changed due to unforeseen circumstances such as measurement errors.

If the traceplot shows no special pattern, there's nothing we need to do in order to make our posterior looks stable. But if there is, I'd recommend you take the centered form of each independent variable and rerun the model. Research has found that correlations between regression parameters may be reduced if we transform each independent variable by taking the center term of them.

Long story short, we've seen a traceplot in a regression analysis which provides a visual way to inspect sampling behavior and assess mixing across chains and convergence. It's critical that we're able to tell if we need to write a new model or it's good to stay. Now, let's move on to the next diagnostic.

Video 5: Regression: Effective Sample Size

In this video we'll continue to explore another important diagnostics about regression - effective sample size.

Question 1: What does effective sample size help us understand the quality of the Bayesian model?

Ideally, the effective sample size should be close to the actual sample size. One advantage of the NUTS sampler is that the from most of my experience talking MCMC samplings, the effective sample size of NUTS is usually much higher than most other algorithms, such as Metropolis (in which binomial model is default to use). The effective sample size can help give a certain precision of the posterior estimates. For example

Too low sample size means that the model quality is bad. There's a very famous warning message that typically points out the problem of low effective sample size during modeling.

Question 2: Why should the model contain a decent proportion of effective sample size?

The effective size is essential to ensure that the model estimates, including posterior statistics like mode, mean and standard deviation are stable. Otherwise, if a certain parameter has low effective sample size, under my knowledge the model estimate of that certain parameter might be unstable. The situation of unstable model estimates could exhibit in various forms, including most of the time it shows a long tail or long tails on both sides when low effective sample size occurs. Sometimes the scenario is even serious, where the posterior distribution may look flat, indicating that our updated belief does not show we are more knowledgeable because of the incoming data, when effective sample size goes near 0 or goes negative. In these cases, the posterior variance will be much greater than usual.

Question 3: How to visualize effective sample size in PyMC3?

Cool! As an extra piece of knowledge, PyMC3 will return a warning message if the effective sample size of any variable goes lower than 200, so we might want to examine the model specification and re-run the model to increase the effective samples when we see the message. As a data scientist, we want the effective sample size to be close to the actual number of posterior samples, because we want the estimates that are reliable and consistent.

In this video, we've discussed the importance of checking effective sample size for all regression coefficients. At the end of the day, you know it's important to ensure that every parameter in a Bayesian regression model should return a reasonable amount of effective samples so that the posterior estimates are reliable to interpret. Let's now move on to the next diagnostic!

Video 6: Regression: Highest Posterior Density

Question 1: What does the highest posterior density use for?

In this video, we will learn deeper into a commonly used device to summarize the characteristics of posterior distributions - the highest posterior density method. First in this example, the posterior statistics shows the strength of intercept and three predictor's effects on Ted talk views. A major Bayesian approach of summarizing posterior statistics is through interpreting the credible interval of the posterior.

The figure compares the 90% HDI and another credible interval that has 90% mass.

According to Hyndman (book in 1996), a highest posterior density is defined as the shortest interval on a posterior density for some given confidence level, such as 90%. A highest posterior density interval starts from the posterior mode (where the posterior density is the highest) and extends the interval such that the posterior density inside the interval is always at least as large as the density outside of the interval, until its coverage reaches the specified probability such as 90%. ArviZ has many functions to help us summarize the posterior, for example, az.plot_posterior can be used to generate a plot with the mean and HPD of a distribution. In PyMC3 and Arviz, the reason why the developers make the hdi_prob option, is that they want us to customize the level of uncertainty when using the highest density interval to solve various problems which have different credible interval requirements. Let's look at some code.

Question 2: What are the characteristics of the highest density interval?

We've discussed the reference value of testing if regression effects are different from null effects. We now understand that when we want to approximate posterior distributions of variables using a finite number of samples, we need to use the highest posterior density plot to check if we have reasonable distribution. It is currently and will continue to be a fascinating way to study the uncertainty with the more dense representation of the posterior distribution.

Video 7: Regression: Convergence and Autocorrelation

Another type of common issue where the posterior samples derived from the MCMC algorithm might be problematic are failing to converge or obtaining highly autocorrelated posterior samples. In this video, we'll look into specifically the convergence and autocorrelation issues as a crucial posterior diagnostic.

Question 1: In what way would we check convergence and autocorrelation in posterior diagnostics?

At a relatively late stage of posterior diagnostics, we often want to validate if we'll wrap up our study with the current model, let's say the Bayesian regression model. From the last video, we checked the effective sample size, where we know if the number of effective samples goes lower than 200 as a critical threshold, we'll raise an alert and decide to somehow rerun the model. We do that in an attempt to make sure the effective sample for all parameters are at least as large as 200 (at least in my course) so the resulting posterior estimates are more reliable. So let's assume we pass the effective sample check, the next things we need to check are a convergence diagnostic and the autocorrelation plot.

In terms of convergence, remember we've made traceplots in the previous video that if the chains are well converged, we should see each trajectory should look similar to each other on the right panel.

Question 2: What are some requirements for autocorrelation in Bayesian regression models?

Now we will check autocorrelation. So what is autocorrelation? Now we have a trajectory of posterior samples for each simulation. The autocorrelation describes the relationship where the order of the sample can contribute to the posterior value. Here is an example of the S&P 500 index in the stock market.

These sneaky, increasing or decreasing patterns of the S&P 500 during the 2019-2021 time frame are called autocorrelation. In this graph, the S&P 500 index shows as a function of the order of samples -- time is an increasing factor of the stock index.

Image from: https://nextlevel.finance/sp-500-all-time-high/

Question 3: What causes autocorrelation in Bayesian regression posterior diagnostics?

I can't exactly tell in what cases the posterior samples would guarantee to show high autocorrelation. It's good to read some books such as and see if there are recent findings about detecting autocorrelation. In a practical sense, we should be pretty aware of adjusting the step size. If the step size we use is too large or too small, both can lead to high autocorrelation - that is, using a small step size you are likely to accept all steps which lead the chain to display random walk behaviour. On the other hand, using a large step size you are likely to reject necessary steps which lead the chain to be stuck. It's hard to tell but in both cases, the autocorrelation should be high. Knowing this, we should specify at least 1000 tuning steps in the PyMC3 sampling process, and this is exactly the number of tuning samples automatically assigned in PyMC3.

Question 3: Why is convergence and autocorrelation important to check the goodness of model?

In this video, we've discussed the convergence metric and autocorrelation plot in order to finally determine if we could publish the Bayesian regression model and disseminate knowledge derived from the posterior. We find that if the r_hat statistics is close to 1, then all simulations relevant to that variable have good convergence so that the posterior estimate is reliable. Also, we summarize that autocorrelation should be low in order . See you in the next video!

Video 8: Regression: Posterior Predictive Check

In this lecture, we will learn how to perform a posterior predictive check after sampling a regression model in PyMC3 and checking the reliability of the model (e.g. effective sample size, autocorrelation and convergence).

Question 1: What exactly does posterior predictive check mean?

According to Andrew Gelman, posterior predictive checks is a process of checking for systematic discrepancies between the actual samples and the simulated posterior samples. The process of posterior predictive check starts with 1) generating replicated data under the posterior samples (in this case the predicted TED talk views) of our fitted model and then 2) comparing these replicated data to the observed data (which are actual TED talk views in the TED dataset) used for modeling. Posterior predictive check compares the distribution of the predicted TED talk views (i.e. this distribution of predicted samples is conditioned on the actual TED talk views. We call it posterior predictive distribution) with the distribution of the actual TED talk views. In other words, it checks how good the posterior predictive distribution of TED talk views approximates the distribution of the actual data samples.

Question 2: Why is posterior predictive check important in Bayesian analysis?

Now let's think of it a little bit deeper. Why shall we compare the samples drawn from the posterior distribution and the actual data samples?

The answer is straightforward -

In course 1, I developed the concept of chaining Bayes' rule, in which I articulated that Bayesian inference allows informative priors as useful knowledge. After posterior predictive check, we're able to determine whether it's reasonable to cumulate the posterior results from the current Bayesian model as prior knowledge to run the next iteration of Bayesian analysis.

Let's start to code!

There are two possible remedies that can rectify incorrect predictions of negative value similarly to our situation.

We can

Question 3: How to show the relationship between predictors and outcome from posterior predictive check?

After posterior predictive check on the TED Talk views against the actual data, it’s always nice to make a plot tailored to our use-case - check the prediction of the effect of predictors to TED Talk views. Here, it would be interesting to plot the predicted relationship between the duration of Ted talks, number of languages available online, and the amount of comments and the TED Talk views. This could be done by pushing the parameters through the model given that we've already sampled posterior predictive samples.

One limitation of conducting full posterior predictive check for multiple variables is that most human can comprehend 2-dimensional plots, but not plots with higher dimension than 2. But you've seen that it's common that a Bayesian regression model can capture actual data if they have reasonable values. Those data points locate closely to the posterior predictive interval shown in the graph. The model can't always predict the data points at the tail if it is more than 3 standard deviation away from the average value but the model places the same weight on the importance of each data point. Statistically, a data point that a model should make the highest weight to create the best prediction line is called influential point (it's out of scope of this course though). We might cover some these topics in more advanced statistical course.

This is the end of the Bayesian regression example. I hope you've gained the technique of sampling a Bayesian model, plotting regression results, conducting several reliability test, and finally making posterior predictions. I'll see you in the next video!

Reference:

Created in deepnote.com Created in Deepnote