Now let's go out and use PyMC3 to look at an example about online advertisement. Online advertisement has ever expanding popularity because of rising internet penetration throughout the world. We can think of PyMC3 here as trying to combine data and all prior belief in a blackbox (i.e. we can't tell exactly the trajectory of each iteration) and we want to see the trajectory and infer knowledge from the distribution of posterior samples. In this example, our goal is really to find a distribution that best represents the cost that an individual online advertiser needs to spend on advertisement providers.
I'm going to make the problem a bit easier for teaching, so our research question is this: Can we utilize the data about whether a user clicked into an advertisement or not and the advertisement provider to predict the cost that a company needs to offer online advertisement to reach 10,000 users? We’ll see if we can find such a cost distribution.
# Let's start with bringing in the pymc3 package
import pymc3 as pm
# and a couple of regular data science packages for our analysis.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
# We actually revisit the click-through rate example from week 3 in course 1. The data for this example
# is stored in our assets folder as online_ads.csv, and contains click-through results of couple hundreds
# of advertisement impressions.
# Now we load our dataframe from the csv file using the pd.read_csv function. This is just a handy way of
# loading csv datasets.
online_ads = pd.read_csv('assets/online_ads.csv')
# Now let's look at the header of this dataset.
online_ads.head()
advertisement | provider | clicked | cost | |
---|---|---|---|---|
0 | MongoDB | 1 | 0.3 | |
1 | MongoDB | 0 | 0.3 | |
2 | MongoDB | 1 | 0.3 | |
3 | MongoDB | 0 | 0.3 | |
4 | MongoDB | 1 | 0.3 |
So our dataset records the name of the advertisement, the provider who launches the advertisement, whether a user visited the advertisement, and the cost of an advertisement impression. I recorded 1 in the clicked column if the user indeed clicked to the advertisement after impression, and 0 if not.
# Let's do a little visual inspection of the proportion of successful advertisement clicks that we are interested in.
# We could simply draw a histogram using the sns.histplot function and customize it using the bins parameter
# and set it as 2.
sns.histplot(x = 'clicked', data = online_ads, bins=2)
# Cool! This gives us an impression that less than a half of users actually clicked on a specific link
# among all users who viewed a page, email, or advertisement.
<AxesSubplot:xlabel='clicked', ylabel='Count'>
# Now let's subset the clicked column grouped by each provider, so create click_google
click_google = online_ads[online_ads['provider'] == 'Google']['clicked']
# and click_yahoo be pandas.Series as observed data.
click_yahoo = online_ads[online_ads['provider'] == 'Yahoo']['clicked']
# Cool! These series have all the records of whether or not a user clicked the advertisement link,
# so that's exactly the input of our model.
There are two main reasons why I choose to use PyMC3 to do the Bayesian modeling process. First, specifying prior distribution and the likelihood in PyMC3 is convenient because the syntax name is good to understand. Second, the model language in PyMC3 it's typically normal in Python language, or we call it Pythonic.
# We start off by explicitly defining a context manager unlike what we did before, so we had a pm.Model()
# It's wrapped by the with statement, and after the as keyword, we can define the model name, here we call
# it cost_distribution.
with pm.Model() as cost_distribution:
# Now we already have a context manager wrapped by a with statement, we still need the prior
# and likelihood in order to send it to a blackbox to update the distribution.
# Actually I've been curious about online advertisements for more than 10 years.
# I would've put a flat prior because I've been observing and clicking on some advertisements but
# not all and there's lots of data that is prior actually could come from.
# Let's say on average I am interested in online advertisements and clicked on the link around 25 times
# per 100 impressions. So we create a prior distribution for probability theta using the pm.Beta
# function as Beta distribution, with the first parameter being 25 and the second parameter being
# 100-25 - Beta(25, 100-25)
theta_google = pm.Beta('theta_google', alpha = 25, beta = 75)
theta_yahoo = pm.Beta('theta_yahoo', alpha = 25, beta = 75)
# That's the prior information. With that said, there's no way I'm going to hit say 95 out of 100
# there's no way I'm going to hit zero since it's barely true according to my past experience.
# That's kind of the idea: we put whatever we know about the problem before we collect the data for simulation.
# How about the likelihood? That's where our data comes into play! So what we're coming up here is a data
# generating mechanism where we come down to pick an appropriate distribution for the data. This is kind
# of an art that you can experiment probabilistic programming as seeing which distribution should be used
# in which cases.
# In this case, we measure click through rate. It's binary data. You get n impressions and you hit on
# X website links, so you can model that with a binomial distribution. We use the pm.Bernoulli function
# here to model each individual, binary click-through indicator by passing the prior probability theta
# and the two arrays of observed click-through data called y1 and y2.
google = pm.Bernoulli('google',p=theta_google, observed=click_google)
yahoo = pm.Bernoulli('yahoo',p=theta_yahoo, observed=click_yahoo)
# In PyMC3, the only difference that distinguishes this from prior is that I've got observed, the
# observed flag that includes the data. It says essentially these are fixed, I've observed these data
# and don't change them.
#mu = pm.normal(observed = online_ads['clicked'])
# The next step is, how do you get posterior distributions? This was the main obstacle, the hard bit
# for Bayesian analysis about 20 years ago when computational techniques were less developed.
# So Andrew Gelman and one of his graduate students came up with an automated sort of method called
# the No U-Turn Sampler in 2014 which tries to prevent "u-turn" coming back and forth in the
# sampling trajectory. You don't actually need to know all that when you do PyMC3. All we do is you
# call sample, we use the pm.sample function, and determine how many samples we want to get, here let's
# say 2000, and how many tuning samples we want, here let's say 1000, and pass with 3 separate chains.
trace = pm.sample(2000, tune=1000, chains = 3)
<ipython-input-5-ae05c8d36eb7>:48: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning. trace = pm.sample(2000, tune=1000, chains = 3) Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Sequential sampling (3 chains in 1 job) NUTS: [theta_yahoo, theta_google]
Sampling 3 chains for 1_000 tune and 2_000 draw iterations (3_000 + 6_000 draws total) took 10 seconds.
# As you can see, PyMC3 determines what algorithm to use for sampling and in this case, the NUTS which
# stands for the No U-Turn Sampler for the click-through rate of Google and Yahoo advertisements.
# And you get a few thousands of samples in less than 10 seconds.
# So what can we do with these models? There are a lot! For instance, we could randomly pull out
# 10 simulated values from each provider in the model samples
import random
theta_google = random.choices(trace['theta_google'], k = 10)
theta_yahoo = random.choices(trace['theta_yahoo'], k = 10)
# make them into a pandas dataframe
compare = pd.DataFrame({'Google': theta_google, 'Yahoo': theta_yahoo})
# and we finally compare the results using a t-test from the stats package. Let's take a look.
stats.ttest_ind(compare['Google'], compare['Yahoo'])
# As you can see, we could contend with confidence that there is a huge difference between the number of people
# reacting to the advertisement that is sent by different providers.
Ttest_indResult(statistic=11.592194911841805, pvalue=8.786833323633555e-10)
Cool! Before we go further to use the models, let's talk a bit more about how they work. First of all, why do I choose 3 chains in the pm.sample function? Here I generated 3 chains so that 3 different simulations for the click-through rate are created, each is a study about people engaging in Google and Yahoo advertisements. If we adjust the chains to 10 instead of 3, then we actually increase the number of experiments. In PyMC3, each simulation can either be convergent (i.e. the sample trajectory oscillates near the destiny value) or divergent (i.e. the sample trajectory diverts from a range of points). Because of that, I'd recommend you to set a larger number of chains when the problem is complex or you're not very confident about your past knowledge and the data collected, but set a smaller number of chains if the problem is easier and you've good knowledge prior to the simulation.
So how about increasing the sample size for each simulation chain, say 20,000? I've used PyMC3 for a couple of years and most of the time if you increase the sample size of the simulation chain, you are more likely to see chains that eventually converge to some points, but the graph you'll see becomes more jaggy as you make the trajectory longer.
That's it! I hope this will help you gain some knowledge from practical experiences in using PyMC3 model sampling. As you might feel marvelous about is the point that the motivation behind PyMC3 is that this high-level language for specifying the Bayesian models has almost the same number of lines of code as you do the math. There's very little extra stuff going on here. Let's come back for the PyMC3 posterior inference in the next video!
In this lecture, we'll move on to discuss the posterior inference in PyMC3 using exactly the online advertisement data for case study. Once we get that posterior distribution you get a bunch of stuff for free. In other words, as you get the samples, you're able to get means and standard deviations of the posterior distribution, credible interval and everything else that you need, which is fantastic. Now let's pause here a bit. As an introduction, we're going to look at several short questions regarding what posterior diagnostics to use for analysis.
The main plot we are going to analyze in this video is called the traceplot.
# The main plot we are going to analyze is called the traceplot. This is kind of a plot to
# represent the entire sampling process and the sample quantities. We can create one by passing through
# the trace object to the pm.traceplot function.
pm.traceplot(trace)
# Cool! So let's look at the right hand side. These are kind of jagged looking plots here. Notice that
# this goes from 0 to 2000, so the x-axis is the samples, each iteration of samples from sample 1 to
# sample 2000. If we zoomed into the early points here, you might see that the chains do not fluctuate
# much since the beginning and of course the chains eventually converge, at around 0.3 for theta_google
# and 0.2 for theta_yahoo.
# And if we were to take all those samples and plot distributions, we'll get the density plots that show
# in the left hand side. You can see that both distributions look quite symmetric in that the first density
# plot which represents that the posterior click through rate for Google advertisements hovers mostly around
# 0.3 and the distribution tapers off for both sides. In other words, for every 10 users who are sent a Google
# advertisement, it's likely to see around three people clicking on the advertisement.
# In the second row we can see another symmetric distribution that the click-through rate for Yahoo
# advertisements hover mostly around 0.2, so for every 10 users who are sent a Yahoo advertisement, it is
# likely to see around two people clicking on the advertisement. It is possible to observe more or
# less people
# Notice that there are three overlapping density plots for each graph because we did three independent
# runs over 2000 samples, but you can see that the shape of each looks generally the same. This is nice
# because it means that the model is in a good shape, which in a technical term convergence.
<ipython-input-7-68e4709d61c7>:5: DeprecationWarning: The function `traceplot` from PyMC3 is just an alias for `plot_trace` from ArviZ. Please switch to `pymc3.plot_trace` or `arviz.plot_trace`. pm.traceplot(trace) /root/venv/lib/python3.9/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
array([[<AxesSubplot:title={'center':'theta_google'}>, <AxesSubplot:title={'center':'theta_google'}>], [<AxesSubplot:title={'center':'theta_yahoo'}>, <AxesSubplot:title={'center':'theta_yahoo'}>]], dtype=object)
# Let's include a tablet form statistical summary to make our posterior diagnostics become more elaborated.
# This can be achieved by using the pm.summary() function.
# We can pass through a list of statistical functions to the summary function but I would like
# to keep it simple for now.
# Now let's us round all the posterior statistics to two digits
round(pm.summary(trace), 2)
# Cool! Here we get a pandas dataframe where each column contains the result of calculating the
# posterior statistics. The first few columns we can see the posterior mean and standard deviation,
# and the highest posterior density interval for click through rate.
# Looking at the summary we know that the shortest interval possible to cover 95% of the
# posterior click-through rate is [0.24, 0.35] for Google and [0.15, 0.24] for Yahoo.
# For each 100 users who are given an advertisement, now we are confident to say that between 24 and 35
# people will reach the advertisement if that advertisement is sent by Google.
# However, only between 15 and 24 users will click the advertisement if that is sent by Yahoo.
# Towards the end we can understand if the model is in a good shape by looking at the effective sample size
# in ess columns and convergence factors. Let's keep it simple for now and watch how to interpret posterior
# statistics in specialized tutorials moving to week 2.
/root/venv/lib/python3.9/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
theta_google | 0.30 | 0.03 | 0.25 | 0.36 | 0.0 | 0.0 | 5381.0 | 3984.0 | 1.0 |
theta_yahoo | 0.19 | 0.03 | 0.14 | 0.24 | 0.0 | 0.0 | 5599.0 | 4103.0 | 1.0 |
That's it for now! You can see that the power of Bayesian estimation is that now we have a full distribution for sampling from the click-through rate of user viewing advertisements. Not only can we visualize the distribution, but also we can quantify how uncertain the posterior of click-through rate is, and most importantly, assess if our model is in a good shape. It's fortunate that our model is in good shape according to our observation in traceplot, so we don't actually need to make changes to the model prior and likelihood. But sometimes when the model becomes more complex, as you'll encounter in week 2 and week 3, changing the model using different prior as a starting point might be needed. But the main point I wanted to get across is that we can use this cool library PyMC3 to get a lot of useful information verbally and visually out of a dataset, which we don't just get a simple point estimate.
The Bayesian workflow is more than just specifying the model and running inference. Certainly those are the most notable components but just specifying models and generating samples do not constitute everything that could be useful in a full Bayesian workflow.
So ArviZ fits in for the other things - it is a Python package for visualization in Bayesian workflow. It fits in for evaluating the prior distributions, it helps us evaluate sample diagnostics and it helps summarize the posterior distribution or model fit.
In PyMC3 we've looked at the traceplot (well Arviz also provides that), but we definitely need more visualizations to help characterize some uncertainty for whatever problems that we are working on.
So the whole point of the Arviz Python package is to help organize data to make complex calculations much much easier for everybody. It helps us look inside the shape of the prior distributions and explore posterior results by performing more diagnostics on it after simulating the model in PyMC3.
# In this case, we are loading the diagnostic data from the 3 chains of 2000 samples of online advertising
# click-through rate. Let's import the Arviz library.
import arviz as az
# The first visualization which is useful for diagnosis is called the plot_posterior.
# We can create a highest posterior density plot for each provider regarding the click
# through rate by passing through the trace object, and setting the posterior probability to 0.95.
az.plot_posterior(trace , hdi_prob = 0.95)
# Here we get a distribution of the posterior click through rate right here for Google on the
# left panel, and Yahoo on the right panel. Both graphs are almost symmetric and we'll see the click-through
# rate of both providers are narrow intervals.
# Particularly, the most probable 95% falls between 0.24 and 0.36 for Google, and between 0.14 and 0.24
# for Yahoo, so no more than 40% of ads will receive a visit after watching. It barely happens.
/root/venv/lib/python3.9/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
array([<AxesSubplot:title={'center':'theta_google'}>, <AxesSubplot:title={'center':'theta_yahoo'}>], dtype=object)
Here we conclude the short Arviz tutorial! Most of the tricks of using Arviz are nothing more than importing the package and applying one of the various Arviz functions that are available in the Arviz gallery (https://arviz-devs.github.io/arviz/examples/index.html). I encourage you to dive into some Arviz plots and reading these may help you choose the best visualization to interpret results to various target audiences.
Now we've explored the PyMC3 and the Arviz packages in Python. Both packages are important in this course because they support Bayesian modeling from planning the analysis, to executing models and interpreting posterior results. Now let's come back to the beginning. Our research question is Can we utilize the data about whether a user clicked into an advertisement or not and the advertisement provider to predict the cost that a company needs to offer online advertisement to reach 10,000 users? Let's rewrite a bit about the model and this time we aim to find the cost distribution for each provider.
# From the dataset we know that the cost for online advertisement is 0.3 per impression.
online_ads['cost']
cost = 0.3
# We again start off by defining a context manager using the pm.Model() argument and call the model
# cost_distribution.
with pm.Model() as cost_distribution:
# We'll use the same prior distribution for Google and Yahoo advertising providers regarding the
# click through rate
theta_google = pm.Beta('theta_google', alpha = 25, beta = 75)
theta_yahoo = pm.Beta('theta_yahoo', alpha = 25, beta = 75)
# Awesome! The next step is calculating the cost distribution for each provider. Because the number
# of impressions is not exactly the amount of users the advertisement reaches, you need to divide the
# number of users reached (here is 10,000) by the click through rate in order to find out how many
# impressions are needed to reach that amount of users. With that said, the lower the click
# through rate is, the more impressions are needed in order to reach 10,000 users. So the
# cost distribution is then calculated by multiplying the cost of each advertisement impression
# times 10,000 (which is the target user outreach) divided by the click-through-rate we have.
# That is, theta_google for Google, and theta_yahoo for Yahoo.
cost_google = pm.Deterministic('cost_google', cost * 10000 / theta_google)
cost_yahoo = pm.Deterministic('cost_yahoo', cost * 10000 / theta_yahoo)
# How about the likelihood? In this case, we don't need to change anything and still measure click
# through rate using the binary data. Remember that you get an impression and you hit on
# a specific website links with the probability given by theta_google, and theta_yahoo.
# So we use the pm.Bernoulli function here to model each individual, binary click-through indicator
# by passing the prior probability and the two arrays of observed click-through data called
# click_google and click_yahoo.
google = pm.Bernoulli('google',p=theta_google, observed=click_google)
yahoo = pm.Bernoulli('yahoo',p=theta_yahoo, observed=click_yahoo)
# The next step is, how do you get posterior distributions? Because the first model we implemented was
# totally in good shape, in which all the trajectories converge to a certain number, we can use the same
# pm.sample argument. But for a tutorial purpose, let's customize a bit for the sampling condition.
# This time let's determine that the number of samples we want to get for each trajectory is 5,000, so we change
# the first argument to 5,000. Since we haven't talked about the tune option yet, including 1000 tune samples
# means that the model will first draw a thousand samples and discard them from analysis, before drawing
# another 5000 samples that will show in the posterior diagnostics. So this option helps us to discard
# the first n samples from the model that otherwise the model might fluctuate unpredictably at the
# beginning of the trajectory. Finally, let's still simulate 3 crews of users engaging in Google, and
# Yahoo advertisements separately.
trace = pm.sample(5000, tune=1000, chains = 3)
<ipython-input-11-b3763578fb18>:40: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning. trace = pm.sample(5000, tune=1000, chains = 3) Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Sequential sampling (3 chains in 1 job) NUTS: [theta_yahoo, theta_google]
Sampling 3 chains for 1_000 tune and 5_000 draw iterations (3_000 + 15_000 draws total) took 19 seconds.
Great! Now the trace object includes everything, including the expected cost needed for an organization to launch an online advertisement to 10,000 users.
pm.traceplot(trace)
# From the right hand size we see that all click through rates or the cost distribution for online
# advertisement appear to converge really well in all simulation chains. So it's nice as we
# don't need to be rerun the model with different model specifications.
# Now look at the left-hand side, which are density plots for the posterior distribution of click
# through rate and cost distribution for Google and Yahoo providers. We can see that the posterior click
# through rates for both providers seem to be quite symmetric with the highest point located at
# 0.3 and around 0.2 respectively. But if you look at the cost distributions, both graphs show a long
# tail at the upper end, which suggest that the posterior cost distributions are likely to be right skewed.
<ipython-input-12-4b673fced3da>:1: DeprecationWarning: The function `traceplot` from PyMC3 is just an alias for `plot_trace` from ArviZ. Please switch to `pymc3.plot_trace` or `arviz.plot_trace`. pm.traceplot(trace) /root/venv/lib/python3.9/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
array([[<AxesSubplot:title={'center':'theta_google'}>, <AxesSubplot:title={'center':'theta_google'}>], [<AxesSubplot:title={'center':'theta_yahoo'}>, <AxesSubplot:title={'center':'theta_yahoo'}>], [<AxesSubplot:title={'center':'cost_google'}>, <AxesSubplot:title={'center':'cost_google'}>], [<AxesSubplot:title={'center':'cost_yahoo'}>, <AxesSubplot:title={'center':'cost_yahoo'}>]], dtype=object)
# Now let's zoom into interpreting the highest posterior density of the cost distribution.
# We should call the plot_posterior function in Arviz and set hdi_prob to 0.95 to produce
# 95% highest posterior density plots.
az.plot_posterior(trace, hdi_prob = 0.95)
# Interpretation 1: Cost Distribution
# Cool! At the end of the day we know that we are expected to spend around 10,000 bucks to the Google engine
# for online advertising if we were trying to reach 10,000 users, but how about Yahoo? It's quite different.
# Instead of spending 10,000, we need to pay up to slightly less than 16,000 bucks to the Yahoo engine
# in order to reach the same amount of users.
# Interpretation 2: Cost-effectiveness
# Seeing the plot you might want to recommend using the Google engine because it's more cost-effective
# compared to using the yahoo engine. Let's look at the right tail of the 95% highest posterior density
# plot for Google, you can see the cost 12,104 is actually larger than the left tail of the 95% highest
# posterior density plot for Yahoo, which is 11,874. What it tells us is that there are overlaps between
# the two posterior cost distributions, so we still fall short in concluding that the Google engine
# is more cost-effective in terms of online advertising than Yahoo with substantial evidence.
/root/venv/lib/python3.9/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context. warnings.warn(
array([<AxesSubplot:title={'center':'theta_google'}>, <AxesSubplot:title={'center':'theta_yahoo'}>, <AxesSubplot:title={'center':'cost_google'}>, <AxesSubplot:title={'center':'cost_yahoo'}>], dtype=object)
(in-class activity: In your perspective, what different research questions do you find interesting in online advertising? Here is the time you can propose alternative questions. You're more than welcome to write down and compare your questions with your peers!)
From this example, we found that one of the most salient advantage of using PyMC3 is that the model diagnostics not only returns the posterior cost distributions here for both online ads providers, but also helps us quantify the variability of both posterior distributions especially on both tails, thus yielding more realistic predictions. Noticed that the dataset that I collected for both engines was just taken from a short timespan so it might not represent the historical trend about the online advertisement cost for both engines. If some of you do encounter online advertisement problems as a data scientist or consultant, you might need more advanced longitudinal data to determine which search engines will best suit your client.