# So let's come back to the question. What makes a popular Ted Talk?

# Let's start with bringing in the pymc3
import pymc3 as pm
# and Arviz package
import arviz as az

# and a couple of regular data science packages for our analysis.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# We start by importing the Ted talk dataset using the pd.read_csv function 
ted_talk = pd.read_csv('assets/ted_main.csv')
# and have a sneak peek on the header of the data.
ted_talk.head()


# It's also a great practice to understand how large the data that we are going to analyze is, because the 
# larger sample size of the data we'll expect that the data likelihood can actually overwhelm the prior
# so that our result will base more on the data, our evidence. So let's 
# find out the size of our data.
ted_talk.shape

# Cool!

(2550, 17)


# Pre-processing is optional

# Choose columns: Only keeping the variables to be used in the analysis
ted_talk = ted_talk[["comments","duration","languages","main_speaker","num_speaker",
                    "published_date","ratings","speaker_occupation","tags","views","title"]]

# Since we found the published date of Ted talks are coded with Unix timestamp, and the timestamps 
# is coded with unit second, we can transform those timestamps into readable dates by passing the 
# timestamps and specifying unit as second to the pd.to_datetime function. It's useful very often 
# to convert different time formats
ted_talk['published_date'] = pd.to_datetime(ted_talk['published_date'], unit='s')

# To simplify the interpretation of video views, I also broadcast the entire columns by dividing the 
# quantity by 1 million, so now the views are represented by a unit of million.
ted_talk['views'] = ted_talk['views'] / 1000000

# It also sounds better to convert the video duration from seconds to minutes by dividing 
# the duration by 60.
ted_talk['duration'] = ted_talk['duration'] / 60

# Now let's take a glimpse at the data set again.
ted_talk.head()

# The duration, and the views columns now look easier to interpret! In the future it's not necessary 
# to do that but it's good to keep in mind that broadcasting is a convenient way to clean a dataset.


ted_talk.tail()


# Let's start off by creating a context manager called pm.Model() and contextualizing the model 
# name as ted_view_regression, so we know we are feeding a regression model on the Ted talk views.
with pm.Model() as ted_view_regression:
    
    # The first step we start is specifying priors distribution. Because the goal of performing a regression model is 
    # to find out how large the effect is for each predictor associated with the increase of Ted talk views.
    
    # The first parameter we need is the initial effect. We ask a question: how popular would the talk be
    # if there were a hypothesized talk without any comment[# comment = 0], without any transcription 
    # of language[# languages = 0], and with zero length[# duration = 0]? 
    
    # Most of the time we do not know exactly the value of intercepts, when we set a prior for it, it's 
    # reasonable to make a generic choice. We can use a very wide normal prior by setting the value of
    # standard deviation large. We can't have negative views, so it seems a reasonable number.
    # We make use of the idea that the average view of TED talk hovers around
    # 5 millions, so we start with a mean of 5 million and standard deviation of 3 million. At the end of 
    # the day, I've little faith of this intercept to be 5 million, so let's place a large standard deviation.
    intercept = pm.Normal("Intercept", mu = 5, sd = 3)

    # Regression coefficients:
    # Now let's make educated guesses: based on our prior knowledge, how much popularity will 
    # a video gain if the talk is one minute longer?
    # We hypothesize positive effects on the views when increasing every minute of duration, 
    # every extra language and comment.
    beta_duration = pm.Normal('duration', mu = 0.05, sd = 0.3)
    # How much popularity will a video gain if the talk offers one more language transcription?
    beta_languages = pm.Normal('languages', mu = 0.05, sd = 0.1)
    # By how much does an increase of video comment affect the Ted talk popularity? 
    beta_comments = pm.Normal('comments', mu = 0.05, sd = 0.1)
    
    # Error Term:
    # One assumption necessary to satisfy in order to predict the relationship between these predictors 
    # and the popularity is that it requires us to define a homoscedastic variability of our model.
    # The variance has to be a positive number, because of this, distributions such as half normal 
    # and half Cauchy focuses only on the positive side of the normal or Cauchy distributions respectively.

    epsilon = pm.HalfCauchy('epsilon', 5)

    # Expected value: View ~ duration, languages, comments
    # expected_value = pm.Deterministic("expected_value", intercept + beta_duration * ted_talk['duration'])
    # + beta_languages * ted_talk['languages'] + beta_comments * ted_talk['comments']
    
    # Specify likelihood (Bayes' Rule: Prior * Likelihood proportional to Posterior)

    # How about the likelihood? That's where our data comes into play! So what we're coming up here is 
    # the popularity of a 2550 TED Talks, and an appropriate distribution to model the number of views.
    # This is kind in probabilistic programming as you experiment on different distributions and make 
    # comparisons until you discover a suitable distribution. It's quite subjective, but at the end 
    # of the day, as you'll be able to tell the shape of each distribution, it becomes more
    # straightforward to process of defining likelihood.

    # In this case, we measure the number of TED talk views. It's a continuous data with positive 
    # number and it could go up to infinity. We can model this phenomenon using the normal distribution.
    # We use the pm.Normal function to pass through the name of the variable - we just call it likelihood,
    # and then notice, second parameter is the expected predictive value of views, which is of the form
    # intercept + the coefficient for video duration times video duration + the coefficient for languages
    # times the number of languages offered in a video + the coefficient for comments times the number 
    # of comments under the video. Oh it's nothing more than an hour programming language, specifying 
    # the linear relationship of predictors (independent variables) to the outcome, which is the 
    # number of views.

    # In the third parameter we pass epsilon, which is the variability of the model, and most importantly
    # we need to provide the actual number of views as observed and finish specifying the likelihood.

    likelihood = pm.Normal(
        'likelihood', 
        mu = intercept + beta_duration * ted_talk['duration'] + \
        beta_languages * ted_talk['languages'] + beta_comments * ted_talk['comments'], 
        sd = epsilon, 
        observed = ted_talk['views'])
    
    # The dependent variable was too large such that the PyMC3 model failed to run for several attempts.

    # We finally reached the sampling step. This time let's draw more samples for each simulation. 
    # Let's return 4,000 samples for each simulation, each with 2,000 initial samples discarded 
    # before the model starts to return sample values. This time I still recruited three groups of samples.
    trace = pm.sample(4000, tune = 2000, chains = 3)

<ipython-input-5-dd2067565713>:76: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning.
  trace = pm.sample(4000, tune = 2000, chains = 3)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 4 jobs)
NUTS: [epsilon, comments, languages, duration, Intercept]

Sampling 3 chains for 2_000 tune and 4_000 draw iterations (6_000 + 12_000 draws total) took 25 seconds.


# Clean code:
with pm.Model() as ted_view_regression:
    intercept = pm.Normal("Intercept", 5, sigma=3)
    beta_duration = pm.Normal('duration', mu = 0.05, sd = 0.3) 
    beta_languages = pm.Normal('languages', mu = 0.05, sd = 0.1) 
    beta_comments = pm.Normal('comments', mu = 0.05, sd = 0.1)
    epsilon = pm.HalfCauchy('epsilon', 5)

    likelihood = pm.Normal('likelihood', mu = intercept + beta_duration * ted_talk['duration'] + beta_languages * ted_talk['languages'] + beta_comments * ted_talk['comments'], sd = epsilon, observed = ted_talk['views'])
    trace = pm.sample(4000, tune = 2000, chains = 3)

<ipython-input-6-31795a02d6a3>:10: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning.
  trace = pm.sample(4000, tune = 2000, chains = 3)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 4 jobs)
NUTS: [epsilon, comments, languages, duration, Intercept]

Sampling 3 chains for 2_000 tune and 4_000 draw iterations (6_000 + 12_000 draws total) took 22 seconds.
The acceptance probability does not match the target. It is 0.8917431144631238, but should be close to 0.8. Try to increase the number of tuning steps.


# Before reaching the posterior statistics let's look at what's the predicted value of Ted talk views 
# given our model. In order to compare the true and predicted value of views, we should create an array 
# called like predicted and it's computed as the summation of all regression
# coefficients times all independent variables plus the intercept.
predicted = trace['Intercept'].mean() * np.repeat(1, len(ted_talk['duration'])) + trace['duration'].mean() * ted_talk['duration'] + trace['languages'].mean() * ted_talk['languages'] + trace['comments'].mean() * ted_talk['comments']
# Compare the predicted value coming from the model against the true views directly got from the data.
pd.DataFrame({"true": ted_talk['views'], "predicted": predicted})

# As you can see from this Data frame, if the actual views of a talk is high, the regression model tends 
# to return a larger expected views as outcome, and the vice versa is true. 
# So we claimed that the regression model can explain 
# partial variability of the observed views given the duration of videos, the number of possible languages, add the 
# number of audience comments. But there is some variability unexplained. Substantial predicted errors 
# exist in cases when the true value goes higher than 20. 
# It is evident because in row 5, a 20.68 million view talk is predicted to have 4.3 million views. 
# It might be because this video indeed belong to unnaturally high views talks, but we are 
# uncertain about the reason whether there might be time-series patterns associated with the Ted talk 
# views or not.


# We can draw that regression plot using sns.regplot, and passing just the same inputs as a scatterplot.

sns.regplot(x = predicted, y = ted_talk['views']).set(xlabel = "predicted views", ylabel = "actual views")
# Although regression model could be powerful, we still found many cases where the predicted views are 
# either significantly larger or smaller than the actual number. 
# If the actual video views were well-predicted, then the graph would look as if the dots are very close
# to the regression line, and the bars with lighter blue color would not expand as predicted views 
# increase. Now as the predicted views increase, the model is less confident about the precision of
# the estimated views. The uncertainty is quite understandable. One possible way to speak with that
# is with more interactions (given by the number of comments) and potentially wider audience (given 
# by a large number of available languages), it becomes progressively harder to predict exactly how 
# many audience would eventually watch the Ted talk, thus representing the model estimate with wider
# error bars. But anyway, the linear aggression can partially 
# explain the overall variability of actual views because the positive direction of prediction 
# versus the actual views is accurate.

# As a practice, I encourage you to look into negative predicted view cases, and then rerun the model
# with different priors and with other predictors that you believe are important. Let's try coding yourself!

[Text(0.5, 0, 'predicted views'), Text(0, 0.5, 'actual views')]


# The easiest way to summarize the posterior distribution of each variable is to use the plot_posterior function in 
# Arviz. This function accepts a PyMC3 trace object as the main argument, with the hdi_prob option 
# allowing users to specify the user-defined highest posterior density level. In this example I want to set 
# it as 0.95 to return the 95% posterior plot of the initial effect called intercept, 
# and regression effects including duration, languages, comments, and the error term epsilon.
az.plot_posterior(trace, hdi_prob = 0.95)

# Cool! You can see five plot graphs, all of which are called the highest posterior density plots.
# The Arviz package runs the kernel density estimation (check week 4 in course 1) for continuous variables.
# It shows the credible parameters together with the mean of the distribution and show the 95% highest
# posterior density as a thick black line at the bottom of each plot. 

# To recap, the posterior density plot is just a typical kdeplot except that it allows 
# us to visualize where does more posterior samples tend to hover and where you'll hardly see any posterior 
# samples.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([[<AxesSubplot:title={'center':'Intercept'}>,
        <AxesSubplot:title={'center':'duration'}>,
        <AxesSubplot:title={'center':'languages'}>],
       [<AxesSubplot:title={'center':'comments'}>,
        <AxesSubplot:title={'center':'epsilon'}>, <AxesSubplot:>]],
      dtype=object)


# Question 3: How to find out the relationship between predictors?

# The second step of posterior diagnostics is to inspect correlations between parameters.
# The Arviz package allows you to visualize the pairwise correlation among all variables provided by passing
# a list of variables to the az.plot_pair function.

az.plot_pair(trace, var_names = ['duration', 'languages', 'comments'])

# Cool! We drew 3 pair plots together to demonstrate the linear relationship among three predictors.
# You can see that pair plots in Arviz allow us to see the relationships between two variables.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([[<AxesSubplot:ylabel='languages'>, <AxesSubplot:>],
       [<AxesSubplot:xlabel='duration', ylabel='comments'>,
        <AxesSubplot:xlabel='languages'>]], dtype=object)


# Finally, let's envision how many views will a typical three language, 15 minutes video with 300 comments be? 
# We often ask these "what if" scenarios, and the way of understanding the approximate number of views could 
# be just adding the linear combination of regression coefficients and the parameter value in that scenario.
# We can first import the random module.
import random

# There are many ways we could predict Ted talk views and I'm using the random sampling methods. For each
# of the variables - intercept, duration, languages, and comments, I sampled 100 values from the trace object
# because I want the variability inside each variable can be minimized.
intercept = random.choices(trace['Intercept'], k = 100)
# And I multiplied the trace values of duration by 15
duration = np.multiply(random.choices(trace['duration'], k = 100), 15)
# the trace values of languages by 3
languages = np.multiply(random.choices(trace['languages'], k = 100), 3)
# and the trace values of comments by 300
comments = np.multiply(random.choices(trace['comments'], k = 100), 2000)
# Cool! Now we can visualize how many views this video will get!
sns.kdeplot(intercept + duration + languages + comments)

# From this linear regression formula we found that Ted talk videos with 3 languages, 15 minutes and 300 comments
# will most likely attract around 600,000 views, which is less than a third of an average video. 
# However, the uncertainty shown in the kdeplot also suggests that it's possible to predict such video with 
# 1 million views or less than 100,000 views, albeit the probability is significantly lower given the low density.

<AxesSubplot:ylabel='Density'>


# We can make traceplots Bayesian inference provides a posterior diagnostic of traceplot on top of just 
# a traditional best fitting line attempting to minimize predictive errors. Using a traceplot we can 
# analyze the whole posterior distribution of likely parameters. 
# Lets plot the posterior distribution of our parameters and the individual samples we drew.

pm.traceplot(trace)

# Cool! We get two plots corresponding to one or more Markov chains for each variable. 
# So there are 5 * 2 subplots in total.

<ipython-input-12-e5348ecd74d9>:5: DeprecationWarning: The function `traceplot` from PyMC3 is just an alias for `plot_trace` from ArviZ. Please switch to `pymc3.plot_trace` or `arviz.plot_trace`.
  pm.traceplot(trace)
/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([[<AxesSubplot:title={'center':'Intercept'}>,
        <AxesSubplot:title={'center':'Intercept'}>],
       [<AxesSubplot:title={'center':'duration'}>,
        <AxesSubplot:title={'center':'duration'}>],
       [<AxesSubplot:title={'center':'languages'}>,
        <AxesSubplot:title={'center':'languages'}>],
       [<AxesSubplot:title={'center':'comments'}>,
        <AxesSubplot:title={'center':'comments'}>],
       [<AxesSubplot:title={'center':'epsilon'}>,
        <AxesSubplot:title={'center':'epsilon'}>]], dtype=object)


# Since effective sample size for each parameter is important to determine if the model has a good convergence, 
# we can start with using the pm.summary() function to get the values in tabular form.

effective_sample_size = pm.summary(trace)

# Let's show the table 
effective_sample_size

# From the table, we can look at the ess_bulk value to understand how large the effective sample for each
# variable is. We see that all Intercept, duration, languages, comments, and the constant error term has
# more than 5000 effective samples, which are significantly greater
# than 200, a critical threshold that indicates low effective sample size.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(


# We could indeed validate by looking at a correspondence of posterior plot and effective sample size.
# We can see such a plot by looking at the forest plot, passing through the trace object and setting the 
# ess option to be True.
az.plot_forest(trace, ess = True, hdi_prob = 0.95)

# Cool! On the left panel of the forest plot we can see a horizontal representation of the 
# 95% highest posterior density interval. On the right panel we can see a blue dot showing the effective samples 
# for each variable. The effective sample size is larger than thousands each parameter indicates that 
# the estimation of the posterior distribution of each parameter is reliable and posterior variance is not high. 

# As a rule of thumb, Andrew Gelman and Osvaldo recommend that having an effective sample size more than 1000 will
# be sufficient that we no longer need to worry about the tails and the posterior distribution should be
# informative since the spread of posterior distribution won't be unreasonably large.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([<AxesSubplot:title={'center':'95.0% HDI'}>,
       <AxesSubplot:title={'center':'ess'}>], dtype=object)


# Let's zoom in to only look at duration, language and comment effects. We specify the var_names option
# and provide a list to tell the program to show whatever variables I provide to the function.
az.plot_forest(trace, ess = True, var_names = ['duration', 'languages', 'comments'], hdi_prob = 0.95)

# From the left panel, we understand that the 95% highest posterior density interval of the three variables
# does not contain 0, which indicates longer video, a video with more languages and comments all significantly
# contribute to more Ted talk views. The forest plot suggests that the short interval might be the result of 
# high effective samples obtained from the posterior samples during sampling.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([<AxesSubplot:title={'center':'95.0% HDI'}>,
       <AxesSubplot:title={'center':'ess'}>], dtype=object)


# We can use the plot_posterior function again and now we try a different highest density interval 
# probability and set it as 0.99 

az.plot_posterior(trace, hdi_prob = 0.99)

# From the graph, we see that the highest posterior density interval - interval with minimal length among all 
# credible intervals with the same probability level, all do not include 0. 
# We can also see that the highest posterior density plot is not equal-tailed, but always includes the mode(s) 
# of posterior distributions.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([[<AxesSubplot:title={'center':'Intercept'}>,
        <AxesSubplot:title={'center':'duration'}>,
        <AxesSubplot:title={'center':'languages'}>],
       [<AxesSubplot:title={'center':'comments'}>,
        <AxesSubplot:title={'center':'epsilon'}>, <AxesSubplot:>]],
      dtype=object)


# To validate whether the highest posterior density intervals of all 5 variables step across 0, we can
# add the ref_val option and set it as 0 to add a vertical line superimposed on the highest posterior density plots.
az.plot_posterior(trace, hdi_prob = 0.99, ref_val = 0)

# All variables are valuable and effective in predicting the Ted talk views and we can see from the orange text that
# we have strong evidence that the posterior estimates indicate significant regression effect for all 5 variables
# to predict Ted talk views.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([[<AxesSubplot:title={'center':'Intercept'}>,
        <AxesSubplot:title={'center':'duration'}>,
        <AxesSubplot:title={'center':'languages'}>],
       [<AxesSubplot:title={'center':'comments'}>,
        <AxesSubplot:title={'center':'epsilon'}>, <AxesSubplot:>]],
      dtype=object)


# Now let's qualify convergence. We could revisit the pm.summary() function to generate a tabular form of 
# posterior statistics.
pm.summary(trace)

# To get the convergence statistics, we're looking at the r_hat column!
# The idea of using the R hat is to compare the variance between chains with the variance within chains.
# Ideally we should expect the R-hat equals one that shows complete convergence among all chains.
# From both PyMC3 and Stan documentations, developers decide to indicate the cases when the mixing of simulations 
# are satisfactory by setting a threshold of 1.05, although different schools of opinions exist for that threshold.

# In this course, we'll adopt the threshold of 1.05, so from now on if we see the R-hat score less than 1.05 
# (and in this example the R-hat for every parameters are 1.0) we'll determine the chains converge, 
# to a specific, reliable posterior value. 

# I personally think this threshold is quite strict but it's adopted by a majority of Bayesian programming 
# developers for practical reasons. In any case, now we can confidently say that the intercept, 
# all the effects of predictors, and the error term have good convergence because the 
# r-hat value for each of them is 1.0. Cool!

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(


# We can also visually look at the convergence diagnostic R-hat by creating a forest plot and set the 
# r_hat option to True
az.plot_forest(trace, r_hat = True)

# Again, R-hat values significantly greater than one indicate that the chains are not yet converged. So in
# this case, we double-confirm that all chains converge to reliable estimates for every parameter. If you're 
# interested in learning deeper about convergence diagnostics, there's a very nice documentation already 
# developed by the Arviz developer's team.

# Documentation: https://arviz-devs.github.io/arviz/api/generated/arviz.rhat.html

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([<AxesSubplot:title={'center':'94.0% HDI'}>,
       <AxesSubplot:title={'center':'r_hat'}>], dtype=object)


# We can check the autocorrelation among samples by using the plot_autocorr() function in Arviz. As the function 
# name suggests, it will generate multiple autocorrelation plots given the variables.

# Useful in particular for posteriors from MCMC samples which may display correlation. 
az.plot_autocorr(trace, combined=True)

# Cool! Now you can see a bar plot of the autocorrelation function for each trajectory of data. 
# The autocorrelation plot shows the autocorrelation function for each chain, for each variable. 
# So there are total 5 rows for intercept, regression effects duration, languages and comments, 
# and the error epsilon, and 3 columns since we've set the chains option in pm.sample as 3. 
# This makes up a total of 15 plots, but having a graph for each simulation might be too messy if
# the number of chains we specify is large due to our belief that a model is complex.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([[<AxesSubplot:title={'center':'Intercept'}>,
        <AxesSubplot:title={'center':'duration'}>,
        <AxesSubplot:title={'center':'languages'}>],
       [<AxesSubplot:title={'center':'comments'}>,
        <AxesSubplot:title={'center':'epsilon'}>, <AxesSubplot:>]],
      dtype=object)


# We can customize the combined option and set it as True in order to combine the chains in the same 
# autocorrelation bar chart. Let's take a look.
az.plot_autocorr(trace, combined=True)

# Autocorrelation is a quantity ranging from -1 to 1. You can see that all sample autocorrelations are dwindling
# as sample trajectory goes large. Some plots show the autocorrelation of samples start at around 0.25 to 0.5 but 
# quickly decrease towards 0, which means the posterior samples we collect from the NUTS algorithm tends to 
# be approximately independent, which is important which suggests we can have reliable posterior estimates.

# We still have the max_lag option in order to user-define maximum lag in the x-axis, but here we can see
# the maximum lag of 100 by default is already reasonable and shows the unautocorrelated nature of all variables.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(

array([[<AxesSubplot:title={'center':'Intercept'}>,
        <AxesSubplot:title={'center':'duration'}>,
        <AxesSubplot:title={'center':'languages'}>],
       [<AxesSubplot:title={'center':'comments'}>,
        <AxesSubplot:title={'center':'epsilon'}>, <AxesSubplot:>]],
      dtype=object)


# The prerequisite of performing a posterior predictive check is a trace object from a Bayesian model,
# which we already obtained from previous lectures.

# To generate samples for the posterior predictive distribution, we can use the sample_posterior_predictive function.
# It's a PyMC3 dedicated function to sample data from the posterior, just like we sometimes sample data from a 
# distribution using scipy. Passing the trace object, and 100 as sample size and the name of the regression model,
# this function will randomly draw 100 samples of TED Talk views (in million) from the trace in this project. 
# We save the samples in a ppc object
ppc = pm.sample_posterior_predictive(
    trace, 100, ted_view_regression, var_names=['likelihood', 'Intercept', 'duration', 'languages', 'comments'])

# For each of the 100 samples in the ppc object, it contains 2550 random draws of numbers from a normal distribution 
# specified by the values of posterior mean and standard deviation of TED talk views in that sample. We can double
# check by printing the length of TED data put into model prediction.
print(ted_talk.shape[0])

# To take advantage of the labeled coordinates and dimensions for exploratory analysis of the posterior results, 
# we have to convert the results to “InferenceData” compatible for Arviz. 
# This can be done with the az.from_pymc3 function and it automatically allows Arviz to get the coordinates 
# and dimensions from the model context by passing through the trace object from model and the posterior_predictive
# option with the samples from the ppc object. 
pred = az.from_pymc3(
    trace = trace,
    posterior_predictive = ppc)

# Arviz has a plot_ppc function to reproduce the posterior predictive distribution against the distribution of 
# actual data. Let's set a larger figure size and give the plot higher transparency, so alpha = 0.05, and make
# the posterior predictive check graph.

az.plot_ppc(pred, figsize = (10,6), mean = True, alpha = 0.05)
plt.xlim(-5, 30)

# As you can see, the posterior predictive check for the TED Talk views indicates that the Bayesian regression
# model seems to generate data that looks very different from the actual observations. None of the predictions
# captures the big spike around 2 million views. But look at the right hand side, it seems that the posterior
# predictive samples overpredict the plausibility of observing data that ranges from 3 million views to around 
# 10 million views, and the same happens at the lower end. This means the Bayesian regression model captures
# just partially the essence of what's happening in reality. 

# Particularly, the predicted TED Talk views shown in blue curves are characterized with much larger variance 
# than the variance of actual TED talk views shown in the black curve. 
# This commonly occurs when comparing the posterior distribution against the distribution of actual data - 
# we call this case overdispersion. In other words, the posterior predictive samples are overdispersed
# against the actual samples. 

# The reason causing the high posterior variance compared to the actual data can vary. Let's take a look
# back at the actual TED Talk views.

2550

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/data/io_pymc3.py:96: FutureWarning: Using `from_pymc3` without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context.
  warnings.warn(
posterior predictive variable likelihood's shape not compatible with number of chains and draws. This can mean that some draws or even whole chains are not represented.
posterior predictive variable Intercept's shape not compatible with number of chains and draws. This can mean that some draws or even whole chains are not represented.
posterior predictive variable duration's shape not compatible with number of chains and draws. This can mean that some draws or even whole chains are not represented.
posterior predictive variable languages's shape not compatible with number of chains and draws. This can mean that some draws or even whole chains are not represented.
posterior predictive variable comments's shape not compatible with number of chains and draws. This can mean that some draws or even whole chains are not represented.

(-5.0, 30.0)


# Let's draw a kernel density estimation plot for the actual TED Talk views.
sns.kdeplot(ted_talk['views'])

# As you can notice, there are a few talks enjoying a much larger amount of views compared to the average.
# Because of this, one potential cause for overdispersion in posterior distribution is the outliers in the
# actual data - the Bayesian regression model overestimates the variance in the data when doing the sampling.
# I invite you to try modeling the log-transformed TED Talk view instead, and see if that could mitigate 
# the overdispersion in posterior predictive check. Cool!

# According to the predictions from posterior predictive, the range which covers 95% highest 
# posterior density stretches from negative 3 to 5 (approximately), so the prediction may fall short due to
# incorrectly predicting negative views.

<AxesSubplot:xlabel='views', ylabel='Density'>


mu_posterior_prediction.mean(0)

1.698486600866732


# We can compute the average of posterior predictive samples by adding the mean value of intercept,
# components of duration, languages and comments.

mu_posterior_prediction = (np.mean(ppc["Intercept"])
                           + np.mean(ppc["duration"]) * ted_talk["duration"]
                           + np.mean(ppc["languages"]) * ted_talk["languages"] 
                           + np.mean(ppc["comments"]) * ted_talk["comments"]).T
_, ax = plt.subplots()

# Let's look at one of the regression coefficents - Duration effect
# sns.scatterplot(ted_talk["duration"], mu_posterior_prediction)

ax.plot(ted_talk["duration"], ted_talk["views"], "o", ms=4, alpha=0.4, label="Data")
#ax.plot(ted_talk["duration"], mu_pp.mean(0), label="Mean outcome", alpha=0.6)

#az.plot_hdi(ted_talk["duration"], mu_posterior_prediction,fill_kwargs={"alpha": 0.8, "label": "Mean outcome 94% HPD"})
# )

# Let's add the duration length in x-axis against the predicted value of TED Talk views as y-axis
# And use the green color to show the 95% HDI.
az.plot_hdi(
    ted_talk["duration"],
    ppc["likelihood"],
    hdi_prob = 0.95,
    fill_kwargs={"alpha": 0.8, "color": "#a1dab4", "label": "Outcome 95% HPD"},
    figsize = (10, 6),
    
)

ax.set_xlabel("Duration")
ax.set_ylabel("Predicted Million Ted Talk Views")
ax.set_title("Posterior predictive checks")
ax.legend(ncol=2, fontsize=10);

# In this graph, we see that using only the duration to predict the TED talk views is highly insufficient. 
# We can see many points that goes on the top - those are the "outliers" that our Bayesian regression falls
# short. But the model captures a vast majority of the data points as shown in the HPI shown in green.

/Users/michiganboy/opt/anaconda3/lib/python3.8/site-packages/arviz/stats/stats.py:456: FutureWarning: hdi currently interprets 2d data as (draw, shape) but this will change in a future release to (chain, draw) for coherence with other functions
  warnings.warn(

	comments	description	duration	event	film_date	languages	main_speaker	name	num_speaker	published_date	ratings	related_talks	speaker_occupation	tags	title	url	views
0	4553	Sir Ken Robinson makes an entertaining and pro...	1164	TED2006	1140825600	60	Ken Robinson	Ken Robinson: Do schools kill creativity?	1	1151367060	[{'id': 7, 'name': 'Funny', 'count': 19645}, {...	[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...	Author/educator	['children', 'creativity', 'culture', 'dance',...	Do schools kill creativity?	https://www.ted.com/talks/ken_robinson_says_sc...	47227110
1	265	With the same humor and humanity he exuded in ...	977	TED2006	1140825600	43	Al Gore	Al Gore: Averting the climate crisis	1	1151367060	[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...	[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...	Climate advocate	['alternative energy', 'cars', 'climate change...	Averting the climate crisis	https://www.ted.com/talks/al_gore_on_averting_...	3200520
2	124	New York Times columnist David Pogue takes aim...	1286	TED2006	1140739200	26	David Pogue	David Pogue: Simplicity sells	1	1151367060	[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...	[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...	Technology columnist	['computers', 'entertainment', 'interface desi...	Simplicity sells	https://www.ted.com/talks/david_pogue_says_sim...	1636292
3	200	In an emotionally charged talk, MacArthur-winn...	1116	TED2006	1140912000	35	Majora Carter	Majora Carter: Greening the ghetto	1	1151367060	[{'id': 3, 'name': 'Courageous', 'count': 760}...	[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...	Activist for environmental justice	['MacArthur grant', 'activism', 'business', 'c...	Greening the ghetto	https://www.ted.com/talks/majora_carter_s_tale...	1697550
4	593	You've never seen data presented like this. Wi...	1190	TED2006	1140566400	48	Hans Rosling	Hans Rosling: The best stats you've ever seen	1	1151440680	[{'id': 9, 'name': 'Ingenious', 'count': 3202}...	[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...	Global health expert; data visionary	['Africa', 'Asia', 'Google', 'demo', 'economic...	The best stats you've ever seen	https://www.ted.com/talks/hans_rosling_shows_t...	12005869

	comments	duration	languages	main_speaker	num_speaker	published_date	ratings	speaker_occupation	tags	views	title
0	4553	19.400000	60	Ken Robinson	1	2006-06-27 00:11:00	[{'id': 7, 'name': 'Funny', 'count': 19645}, {...	Author/educator	['children', 'creativity', 'culture', 'dance',...	47.227110	Do schools kill creativity?
1	265	16.283333	43	Al Gore	1	2006-06-27 00:11:00	[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...	Climate advocate	['alternative energy', 'cars', 'climate change...	3.200520	Averting the climate crisis
2	124	21.433333	26	David Pogue	1	2006-06-27 00:11:00	[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...	Technology columnist	['computers', 'entertainment', 'interface desi...	1.636292	Simplicity sells
3	200	18.600000	35	Majora Carter	1	2006-06-27 00:11:00	[{'id': 3, 'name': 'Courageous', 'count': 760}...	Activist for environmental justice	['MacArthur grant', 'activism', 'business', 'c...	1.697550	Greening the ghetto
4	593	19.833333	48	Hans Rosling	1	2006-06-27 20:38:00	[{'id': 9, 'name': 'Ingenious', 'count': 3202}...	Global health expert; data visionary	['Africa', 'Asia', 'Google', 'demo', 'economic...	12.005869	The best stats you've ever seen

	comments	duration	languages	main_speaker	num_speaker	published_date	ratings	speaker_occupation	tags	views	title
2545	17	7.933333	4	Duarte Geraldino	1	2017-09-19 20:00:16	[{'id': 3, 'name': 'Courageous', 'count': 24},...	Journalist	['TED Residency', 'United States', 'community'...	0.450430	What we're missing in the debate about immigra...
2546	6	4.833333	3	Armando Azua-Bustos	1	2017-09-20 15:02:17	[{'id': 22, 'name': 'Fascinating', 'count': 32...	Astrobiologist	['Mars', 'South America', 'TED Fellows', 'astr...	0.417470	The most Martian place on Earth
2547	10	10.850000	1	Radhika Nagpal	1	2017-09-21 15:01:35	[{'id': 1, 'name': 'Beautiful', 'count': 14}, ...	Robotics engineer	['AI', 'ants', 'fish', 'future', 'innovation',...	0.375647	What intelligent machines can learn from a sch...
2548	32	18.333333	1	Theo E.J. Wilson	1	2017-09-21 20:00:42	[{'id': 11, 'name': 'Longwinded', 'count': 3},...	Public intellectual	['Internet', 'TEDx', 'United States', 'communi...	0.419309	A black man goes undercover in the alt-right
2549	8	8.650000	1	Karoliina Korppoo	1	2017-09-22 15:00:22	[{'id': 21, 'name': 'Unconvincing', 'count': 2...	Game designer	['cities', 'design', 'future', 'infrastructure...	0.391721	How a video game might help us build better ci...

	true	predicted
0	47.227110	21.033519
1	3.200520	3.088669
2	1.636292	1.534230
3	1.697550	2.359842
4	12.005869	4.787989
...	...	...
2545	0.450430	-0.674909
2546	0.417470	-0.859959
2547	0.375647	-0.830412
2548	0.419309	-0.561897
2549	0.391721	-0.891911

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
Intercept	-1.201	0.185	-1.548	-0.857	0.003	0.002	5489.0	6886.0	1.0
duration	0.024	0.007	0.011	0.038	0.000	0.000	6181.0	7019.0	1.0
languages	0.066	0.005	0.057	0.076	0.000	0.000	6845.0	7446.0	1.0
comments	0.004	0.000	0.004	0.004	0.000	0.000	8740.0	5974.0	1.0
epsilon	2.042	0.029	1.989	2.097	0.000	0.000	10549.0	7694.0	1.0

Video 1: Bayesian Regression Concept¶

Question 1: What are the main objectives for data scientists to use Bayesian regression?¶

Question 2: What are the major components of Bayesian regression models?¶

Question 3: Why do we want to use the Bayesian methodologies in regression analysis?¶

Video 2: Regression Sampling Process¶

Question 1: What assumptions should the data meet in order to perform Bayesian regression?¶

Question 2: How to implement Bayesian regression using the PyMC3 package?¶

Research Question:¶

Video 3: Regression: Posterior Statistics¶

Question 1: Why are posterior statistics important for Bayesian analysis?¶

Question 2: What are some posterior statistics that are necessary to gain actionable insights?¶

Video 4: Regression: Traceplot¶

Question 1: What are the major components of the traceplot?¶

Question 2: How to infer convergence/divergence through traceplot?¶

Question 3: On what occasions will the posterior distribution of parameters look stable?¶

Video 5: Regression: Effective Sample Size¶

Question 1: What does effective sample size help us understand the quality of the Bayesian model?¶

Question 2: Why should the model contain a decent proportion of effective sample size?¶

Question 3: How to visualize effective sample size in PyMC3?¶

Video 6: Regression: Highest Posterior Density¶

Question 1: What does the highest posterior density use for?¶

Question 2: What are the characteristics of the highest density interval?¶

Video 7: Regression: Convergence and Autocorrelation¶

Question 1: In what way would we check convergence and autocorrelation in posterior diagnostics?¶

Question 2: What are some requirements for autocorrelation in Bayesian regression models?¶

Question 3: What causes autocorrelation in Bayesian regression posterior diagnostics?¶

Question 3: Why is convergence and autocorrelation important to check the goodness of model?¶

Video 8: Regression: Posterior Predictive Check¶

Question 1: What exactly does posterior predictive check mean?¶

Question 2: Why is posterior predictive check important in Bayesian analysis?¶

Question 3: How to show the relationship between predictors and outcome from posterior predictive check?¶