# Let's start with bringing in the pymc3 package 
import pymc3 as pm
# and a couple of regular data science packages for our analysis.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.


# We actually revisit the click-through rate example from week 3 in course 1. The data for this example
# is stored in our assets folder as online_ads.csv, and contains click-through results of couple hundreds 
# of advertisement impressions.

# Now we load our dataframe from the csv file using the pd.read_csv function. This is just a handy way of
# loading csv datasets.

online_ads = pd.read_csv('assets/online_ads.csv')

# Now let's look at the header of this dataset.
online_ads.head()


# Let's do a little visual inspection of the proportion of successful advertisement clicks that we are interested in.
# We could simply draw a histogram using the sns.histplot function and customize it using the bins parameter 
# and set it as 2. 
sns.histplot(x = 'clicked', data = online_ads, bins=2)
# Cool! This gives us an impression that less than a half of users actually clicked on a specific link 
# among all users who viewed a page, email, or advertisement.

<AxesSubplot:xlabel='clicked', ylabel='Count'>


# Now let's subset the clicked column grouped by each provider, so create click_google
click_google = online_ads[online_ads['provider'] == 'Google']['clicked']
# and click_yahoo be pandas.Series as observed data.
click_yahoo = online_ads[online_ads['provider'] == 'Yahoo']['clicked']

# Cool! These series have all the records of whether or not a user clicked the advertisement link, 
# so that's exactly the input of our model.


# We start off by explicitly defining a context manager unlike what we did before, so we had a pm.Model()
# It's wrapped by the with statement, and after the as keyword, we can define the model name, here we call
# it cost_distribution.

with pm.Model() as cost_distribution:

    # Now we already have a context manager wrapped by a with statement, we still need the prior
    # and likelihood in order to send it to a blackbox to update the distribution. 
    
    # Actually I've been curious about online advertisements for more than 10 years. 
    # I would've put a flat prior because I've been observing and clicking on some advertisements but 
    # not all and there's lots of data that is prior actually could come from.
    # Let's say on average I am interested in online advertisements and clicked on the link around 25 times 
    # per 100 impressions. So we create a prior distribution for probability theta using the pm.Beta 
    # function as Beta distribution, with the first parameter being 25 and the second parameter being 
    # 100-25 - Beta(25, 100-25)
    theta_google = pm.Beta('theta_google', alpha = 25, beta = 75)
    theta_yahoo = pm.Beta('theta_yahoo', alpha = 25, beta = 75)
    # That's the prior information. With that said, there's no way I'm going to hit say 95 out of 100 
    # there's no way I'm going to hit zero since it's barely true according to my past experience. 
    # That's kind of the idea: we put whatever we know about the problem before we collect the data for simulation.

    # How about the likelihood? That's where our data comes into play! So what we're coming up here is a data 
    # generating mechanism where we come down to pick an appropriate distribution for the data. This is kind
    # of an art that you can experiment probabilistic programming as seeing which distribution should be used 
    # in which cases.

    # In this case, we measure click through rate. It's binary data. You get n impressions and you hit on 
    # X website links, so you can model that with a binomial distribution. We use the pm.Bernoulli function
    # here to model each individual, binary click-through indicator by passing the prior probability theta 
    # and the two arrays of observed click-through data called y1 and y2.
    google = pm.Bernoulli('google',p=theta_google, observed=click_google)
    yahoo = pm.Bernoulli('yahoo',p=theta_yahoo, observed=click_yahoo)

    # In PyMC3, the only difference that distinguishes this from prior is that I've got observed, the
    # observed flag that includes the data. It says essentially these are fixed, I've observed these data
    # and don't change them.

    #mu = pm.normal(observed = online_ads['clicked'])

    # The next step is, how do you get posterior distributions? This was the main obstacle, the hard bit 
    # for Bayesian analysis about 20 years ago when computational techniques were less developed.  
    # So Andrew Gelman and one of his graduate students came up with an automated sort of method called 
    # the No U-Turn Sampler in 2014 which tries to prevent "u-turn" coming back and forth in the
    # sampling trajectory. You don't actually need to know all that when you do PyMC3. All we do is you
    # call sample, we use the pm.sample function, and determine how many samples we want to get, here let's
    # say 2000, and how many tuning samples we want, here let's say 1000, and pass with 3 separate chains.
    trace = pm.sample(2000, tune=1000, chains = 3)

<ipython-input-5-ae05c8d36eb7>:48: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning.
  trace = pm.sample(2000, tune=1000, chains = 3)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (3 chains in 1 job)
NUTS: [theta_yahoo, theta_google]

Sampling 3 chains for 1_000 tune and 2_000 draw iterations (3_000 + 6_000 draws total) took 10 seconds.


# As you can see, PyMC3 determines what algorithm to use for sampling and in this case, the NUTS which 
# stands for the No U-Turn Sampler for the click-through rate of Google and Yahoo advertisements. 
# And you get a few thousands of samples in less than 10 seconds.

# So what can we do with these models? There are a lot! For instance, we could randomly pull out
# 10 simulated values from each provider in the model samples 

import random
theta_google = random.choices(trace['theta_google'], k = 10)
theta_yahoo = random.choices(trace['theta_yahoo'], k = 10)
# make them into a pandas dataframe

compare = pd.DataFrame({'Google': theta_google, 'Yahoo': theta_yahoo})
# and we finally compare the results using a t-test from the stats package. Let's take a look.
stats.ttest_ind(compare['Google'], compare['Yahoo'])

# As you can see, we could contend with confidence that there is a huge difference between the number of people 
# reacting to the advertisement that is sent by different providers.

Ttest_indResult(statistic=11.592194911841805, pvalue=8.786833323633555e-10)


# The main plot we are going to analyze is called the traceplot. This is kind of a plot to
# represent the entire sampling process and the sample quantities. We can create one by passing through
# the trace object to the pm.traceplot function.

pm.traceplot(trace)

# Cool! So let's look at the right hand side. These are kind of jagged looking plots here. Notice that
# this goes from 0 to 2000, so the x-axis is the samples, each iteration of samples from sample 1 to 
# sample 2000. If we zoomed into the early points here, you might see that the chains do not fluctuate 
# much since the beginning and of course the chains eventually converge, at around 0.3 for theta_google
# and 0.2 for theta_yahoo. 

# And if we were to take all those samples and plot distributions, we'll get the density plots that show
# in the left hand side. You can see that both distributions look quite symmetric in that the first density 
# plot which represents that the posterior click through rate for Google advertisements hovers mostly around 
# 0.3 and the distribution tapers off for both sides. In other words, for every 10 users who are sent a Google
# advertisement, it's likely to see around three people clicking on the advertisement. 

# In the second row we can see another symmetric distribution that the click-through rate for Yahoo 
# advertisements hover mostly around 0.2, so for every 10 users who are sent a Yahoo advertisement, it is 
# likely to see around two people clicking on the advertisement. It is possible to observe more or 
# less people 

# Notice that there are three overlapping density plots for each graph because we did three independent 
# runs over 2000 samples, but you can see that the shape of each looks generally the same. This is nice
# because it means that the model is in a good shape, which in a technical term convergence.

<ipython-input-7-68e4709d61c7>:5: DeprecationWarning: The function `traceplot` from PyMC3 is just an alias for `plot_trace` from ArviZ. Please switch to `pymc3.plot_trace` or `arviz.plot_trace`.
  pm.traceplot(trace)
/root/venv/lib/python3.9/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([[<AxesSubplot:title={'center':'theta_google'}>,
        <AxesSubplot:title={'center':'theta_google'}>],
       [<AxesSubplot:title={'center':'theta_yahoo'}>,
        <AxesSubplot:title={'center':'theta_yahoo'}>]], dtype=object)


# Let's include a tablet form statistical summary to make our posterior diagnostics become more elaborated. 
# This can be achieved by using the pm.summary() function.
# We can pass through a list of statistical functions to the summary function but I would like
# to keep it simple for now.

# Now let's us round all the posterior statistics to two digits
round(pm.summary(trace), 2)

# Cool! Here we get a pandas dataframe where each column contains the result of calculating the 
# posterior statistics. The first few columns we can see the posterior mean and standard deviation, 
# and the highest posterior density interval for click through rate. 
# Looking at the summary we know that the shortest interval possible to cover 95% of the 
# posterior click-through rate is [0.24, 0.35] for Google and [0.15, 0.24] for Yahoo. 
# For each 100 users who are given an advertisement, now we are confident to say that between 24 and 35 
# people will reach the advertisement if that advertisement is sent by Google. 
# However, only between 15 and 24 users will click the advertisement if that is sent by Yahoo.

# Towards the end we can understand if the model is in a good shape by looking at the effective sample size 
# in ess columns and convergence factors. Let's keep it simple for now and watch how to interpret posterior 
# statistics in specialized tutorials moving to week 2.

/root/venv/lib/python3.9/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(


# In this case, we are loading the diagnostic data from the 3 chains of 2000 samples of online advertising
# click-through rate. Let's import the Arviz library.
import arviz as az

# The first visualization which is useful for diagnosis is called the plot_posterior.
# We can create a highest posterior density plot for each provider regarding the click 
# through rate by passing through the trace object, and setting the posterior probability to 0.95.
az.plot_posterior(trace , hdi_prob = 0.95)
# Here we get a distribution of the posterior click through rate right here for Google on the 
# left panel, and Yahoo on the right panel. Both graphs are almost symmetric and we'll see the click-through
# rate of both providers are narrow intervals.
# Particularly, the most probable 95% falls between 0.24 and 0.36 for Google, and between 0.14 and 0.24 
# for Yahoo, so no more than 40% of ads will receive a visit after watching. It barely happens.

/root/venv/lib/python3.9/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([<AxesSubplot:title={'center':'theta_google'}>,
       <AxesSubplot:title={'center':'theta_yahoo'}>], dtype=object)


# From the dataset we know that the cost for online advertisement is 0.3 per impression.
online_ads['cost']
cost = 0.3


# We again start off by defining a context manager using the pm.Model() argument and call the model 
# cost_distribution.

with pm.Model() as cost_distribution:

    # We'll use the same prior distribution for Google and Yahoo advertising providers regarding the
    # click through rate
    theta_google = pm.Beta('theta_google', alpha = 25, beta = 75)
    theta_yahoo = pm.Beta('theta_yahoo', alpha = 25, beta = 75)
    
    # Awesome! The next step is calculating the cost distribution for each provider. Because the number 
    # of impressions is not exactly the amount of users the advertisement reaches, you need to divide the 
    # number of users reached (here is 10,000) by the click through rate in order to find out how many 
    # impressions are needed to reach that amount of users. With that said, the lower the click 
    # through rate is, the more impressions are needed in order to reach 10,000 users. So the
    # cost distribution is then calculated by multiplying the cost of each advertisement impression 
    # times 10,000 (which is the target user outreach) divided by the click-through-rate we have. 
    # That is, theta_google for Google, and theta_yahoo for Yahoo.
    cost_google = pm.Deterministic('cost_google', cost * 10000 / theta_google)
    cost_yahoo = pm.Deterministic('cost_yahoo', cost * 10000 / theta_yahoo)

    # How about the likelihood? In this case, we don't need to change anything and still measure click 
    # through rate using the binary data. Remember that you get an impression and you hit on 
    # a specific website links with the probability given by theta_google, and theta_yahoo.
    # So we use the pm.Bernoulli function here to model each individual, binary click-through indicator 
    # by passing the prior probability and the two arrays of observed click-through data called 
    # click_google and click_yahoo.
    google = pm.Bernoulli('google',p=theta_google, observed=click_google)
    yahoo = pm.Bernoulli('yahoo',p=theta_yahoo, observed=click_yahoo)

    # The next step is, how do you get posterior distributions? Because the first model we implemented was 
    # totally in good shape, in which all the trajectories converge to a certain number, we can use the same
    # pm.sample argument. But for a tutorial purpose, let's customize a bit for the sampling condition.
    # This time let's determine that the number of samples we want to get for each trajectory is 5,000, so we change 
    # the first argument to 5,000. Since we haven't talked about the tune option yet, including 1000 tune samples 
    # means that the model will first draw a thousand samples and discard them from analysis, before drawing 
    # another 5000 samples that will show in the posterior diagnostics. So this option helps us to discard 
    # the first n samples from the model that otherwise the model might fluctuate unpredictably at the 
    # beginning of the trajectory. Finally, let's still simulate 3 crews of users engaging in Google, and
    # Yahoo advertisements separately.
    trace = pm.sample(5000, tune=1000, chains = 3)

<ipython-input-11-b3763578fb18>:40: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning.
  trace = pm.sample(5000, tune=1000, chains = 3)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (3 chains in 1 job)
NUTS: [theta_yahoo, theta_google]

Sampling 3 chains for 1_000 tune and 5_000 draw iterations (3_000 + 15_000 draws total) took 19 seconds.


pm.traceplot(trace)

# From the right hand size we see that all click through rates or the cost distribution for online 
# advertisement appear to converge really well in all simulation chains. So it's nice as we 
# don't need to be rerun the model with different model specifications.

# Now look at the left-hand side, which are density plots for the posterior distribution of click 
# through rate and cost distribution for Google and Yahoo providers. We can see that the posterior click 
# through rates for both providers seem to be quite symmetric with the highest point located at 
# 0.3 and around 0.2 respectively. But if you look at the cost distributions, both graphs show a long
# tail at the upper end, which suggest that the posterior cost distributions are likely to be right skewed.

<ipython-input-12-4b673fced3da>:1: DeprecationWarning: The function `traceplot` from PyMC3 is just an alias for `plot_trace` from ArviZ. Please switch to `pymc3.plot_trace` or `arviz.plot_trace`.
  pm.traceplot(trace)
/root/venv/lib/python3.9/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([[<AxesSubplot:title={'center':'theta_google'}>,
        <AxesSubplot:title={'center':'theta_google'}>],
       [<AxesSubplot:title={'center':'theta_yahoo'}>,
        <AxesSubplot:title={'center':'theta_yahoo'}>],
       [<AxesSubplot:title={'center':'cost_google'}>,
        <AxesSubplot:title={'center':'cost_google'}>],
       [<AxesSubplot:title={'center':'cost_yahoo'}>,
        <AxesSubplot:title={'center':'cost_yahoo'}>]], dtype=object)


# Now let's zoom into interpreting the highest posterior density of the cost distribution.
# We should call the plot_posterior function in Arviz and set hdi_prob to 0.95 to produce
# 95% highest posterior density plots.
az.plot_posterior(trace, hdi_prob = 0.95)

# Interpretation 1: Cost Distribution
# Cool! At the end of the day we know that we are expected to spend around 10,000 bucks to the Google engine
# for online advertising if we were trying to reach 10,000 users, but how about Yahoo? It's quite different.
# Instead of spending 10,000, we need to pay up to slightly less than 16,000 bucks to the Yahoo engine
# in order to reach the same amount of users. 

# Interpretation 2: Cost-effectiveness
# Seeing the plot you might want to recommend using the Google engine because it's more cost-effective 
# compared to using the yahoo engine. Let's look at the right tail of the 95% highest posterior density
# plot for Google, you can see the cost 12,104 is actually larger than the left tail of the 95% highest 
# posterior density plot for Yahoo, which is 11,874. What it tells us is that there are overlaps between 
# the two posterior cost distributions, so we still fall short in concluding that the Google engine
# is more cost-effective in terms of online advertising than Yahoo with substantial evidence.

/root/venv/lib/python3.9/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([<AxesSubplot:title={'center':'theta_google'}>,
       <AxesSubplot:title={'center':'theta_yahoo'}>,
       <AxesSubplot:title={'center':'cost_google'}>,
       <AxesSubplot:title={'center':'cost_yahoo'}>], dtype=object)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
theta_google	0.30	0.03	0.25	0.36	0.0	0.0	5381.0	3984.0	1.0
theta_yahoo	0.19	0.03	0.14	0.24	0.0	0.0	5599.0	4103.0	1.0

Video 1: PyMC3 Sampling¶

Model specification:¶

Question 1: Why do we want to use PyMC3 to generate posterior samples?¶

Question 2: What is needed to perform a simulation in PyMC3?¶

Question 3: How to determine the sample size and chains in the sampling process?¶

Video 2: PyMC3 Posterior Inference¶

Question 1: What features are essential to interpret after sampling for almost every PyMC3 model?¶

Question 2: In what way could we know the quantities of posterior statistics?¶

Question 3: In what way could we visually understand the posterior results?¶

Question 4: In what way could we know if the sampled model is in a good shape?¶

Video 3: Introduction to Arviz¶

Question 1: What does the Arviz package do?¶

Question 2: Why do we need Arviz during the process of performing Bayesian analysis?¶

Question 3: Where could we find out documentation and recent updates about the Arviz package?¶

Question 4: How does Arviz help fulfill the 4-step Bayesian analysis?¶

Optional Video: Cost Distribution¶

Solution:¶

	advertisement	provider	clicked	cost
0	MongoDB	Google	1	0.3
1	MongoDB	Google	0	0.3
2	MongoDB	Google	1	0.3
3	MongoDB	Google	0	0.3
4	MongoDB	Google	1	0.3