Welcome to the Markov Chain Monte Carlo module. In this video series, you'll learn the concept of Markov Chain Monte Carlo and we want to follow this up with code examples whenever possible, so we'll have a better understanding of the mechanisms under the hood of PyMC3 operation.
As you can see in the videos introducing PyMC3 and Arviz, we've been clear that our goal for Bayesian analysis is to obtain reliable samples on the posterior distribution over all parameters that we've modeled. But most of the time we don't know the exact formulas of the posterior distribution, and we don't have a direct way to calculate the posterior density and cumulative probabilities of these parameters. If we can't get these directly from math, how could we determine the posterior distribution and thus the credible intervals for interpretation? We need Markov Chain Monte Carlo simulations, which is a kind of approximation method. This Markov Chain Monte Carlo is typically what PyMC3 uses to fit Bayesian statistical models.
Suppose we want to understand how much time Coursera learners spend in watching videos, taking quizzes and writing code assignments. Using an approximation method, we sample a few thousand learners and record the values. These values are representative of the underlying distribution, even if we might not be able to tell the exact distribution. In most cases we'll sample video watching times, or quiz times for multiple users so as to approximate the true mean and standard deviation of the underlying distribution. Because the Central Limit Theorem claims that with a large sample (in the last example 3000 samples), the true probability distribution will look quite similar to the normal distribution. In other words, if we can get a large sample of data from course takers, we can infer from those samples to understand the mean and standard deviation of the underlying distribution, then we can approximate many useful characteristics of our posterior samples - so we can deliver insightful knowledge about course taking behaviours supported by evidence-based studies. So in case we can't compute the posterior distribution directly - when a closed form of solution is not available, we'll delegate the task to approximation methods.
Markov Chain Monte Carlo (or MCMC) is a class of algorithms for drawing independent samples from a probability distribution and in Bayesian modeling, the empirical distribution of the samples can be used to estimate the underlying posterior distribution of parameters given a set of observations. So it's useful for Bayesian statistics as another way of finding the posterior distribution than using the mathematical formula of Bayes' rules. MCMC is arguably one of the top 10 most influential algorithms in the 20th century. A number of earlier pioneers had laid significant ground for the MCMC algorithms:
John von Neumann was a vanguard who set up the Monte Carlo methods in the 1940s.
Ulam and Metropolis wrote the first paper in 1949 to outline the way of using Markov Chain for Monte Carlo approximation.
Rosenbluth, Metropolis and Teller wrote a paper in which they applied Markov Chain Monte Carlo to approximate distributions for a chemical problem.
This approach is now commonly used when we don't actually know the true posterior distribution, or a closed form of posterior solution is not feasible, thanks to the launch of computerized methods that enable comprehensive treatments of MCMC since the 1990s.
Markov Chain Monte Carlo it's very useful for Bayesian modeling as a result of the law of large numbers. This law essentially says that if you take the average of the Monte Carlo samples, then your Monte Carlo estimate is going to converge almost surely to the posterior mean. Practically speaking, what this means is that the posterior mean can be estimated by taking the average of the Monte Carlo samples from the posterior distribution.
In a similar fashion, the true posterior variance can be estimated by taking the sample variance of the Monte Carlo samples from the posterior. And likewise, any user-defined posterior quantile value can be estimated by taking the corresponding quantile of the Monte Carlo samples. In short, the MCMC methods can approximate various statistics from complex distributions by sampling. Finally, it's important to notice that Monte Carlo estimates tend to get more accurate as Monte Carlo sample size increases.
In this course, we have an opportunity to implement three MCMC algorithms using PyMC3 - 1) the Metropolis (or Metropolis-Hasting) algorithm, the Hamiltonian Monte Carlo algorithm which is based on Hamiltonian mechanics and the No-U-Turn-Sampling (NUTS) algorithm. I have seen quite a few discussions about MCMC and some of its implementations, specifically the Metropolis-Hastings algorithm and the NUTS sampler in the PyMC3 library.
Markov Chain Fundamentally, a Markov chain can be described as a random walk through space. It initializes at some starting point and at each time step, we decide to move somewhere else or stay on the same spot depending only on where we are currently located at. Put differently, the next step only depends on our current location, not where we've been before that. So in PyMC3, each posterior sample shown in the traceplot depends on the previous sample, and this process continues until the last posterior sample is drawn.
Metropolis-Hastings Metropolis-Hastings algorithm is the earliest known algorithm that uses Markov Chain Monte Carlo to explore a probability distribution without calculating any constants. The whole idea starts at the random point using the proposal distribution.
It's a beautiful algorithm for producing samples given a random starting point and a proposal distribution that determines whether the next sample is accepted to move to another location or the next sample stays at the current location. It's flexible because it can extend to multi-dimensional sampling of distributions - therefore this algorithm is widely used in PyMC3 and is the default algorithm to model binary data and discrete variables.
Hamiltonian Monte Carlo Hamiltonian Monte Carlo methods use differential equations in calculus to estimate the posterior distribution of parameters. These differential equations define the kinetic and potential energy of a system to guide the update of MCMC samples - which helps find candidate solutions for approximating the posterior distribution.
No U-Turn Sampler The No U-Turn Sampler (NUTS) is an extension to Hamiltonian Monte Carlo methods. It was initially created by Matthew D. Hoffman and Andrew Gelman in 2011. It adapts dual averaging - a method for automatically adapting the step size between samples - to produce samples that approximate the underlying posterior distribution without a need of human intervention in tuning the model. Because of this advantage, it is now widely adopted in the PyMC3 sampling model as the default sampling algorithm for continuous distributions (i.e. if we specify continuous distribution for the likelihood function to model the observed data).
The gory details of how these MCMC algorithms work are beyond the scope of this course. If you want to study more use cases and how to write code from scratch to run a MCMC sampling algorithm, I'd recommend you to read an introductory post from towards data science platform. I've included the link to the tutorial at the end of this video.
So remember that we learned the three basic elements for a Bayesian model: 1) a prior distribution summarizing how likely you believe some possible outcomes will happen, 2) a likelihood that describes the probability distributions observed and collected in a dataset regarding the subject of experiment, and 3) posterior distribution which represents the updated belief of the parameter value in terms of distribution after the evidence is made known. Markov Chain Monte Carlo helps us generate samples that approximate the posterior distribution of all variables we want the model to estimate. An MCMC model needs only a few things: first, the prior distribution of all variables; second, a distribution to characterize the likelihood that takes the observe data as input; third, the type of MCMC algorithm to sample from the posterior distribution and; finally, the number of sample for each simulation and the number of simulations to run for the model.
Hopefully this gives you some understanding about the origin of MCMC, what MCMC does and why MCMC is useful and is widely used in the probabilistic programming language - PyMC3 that we're exploring deeply in this course. One last minute reminder for using MCMC is that we've no guarantee that MCMC can sample the posteriors within practical computational time, so it's always a good practice to diagnose the model assumptions, priors and the likelihood and check if the model is correctly specified before execution. Thanks for your listening!