Bayesian Logistic Regression - NCAA case study

Video 1: Logistic Regression - Concept

In this video, we're going to learn about Bayesian logistic regression. Bayesian models are gaining much attention when one chooses a model to decipher complex problems. We often need to model data in this data science world.

Question 1: What is Bayesian logistic regression?

Logistic regression is a generalized linear regression model using the same basic formula of linear regression but it is regressing for the probability of a categorical outcome.

This time we're going to use Bayesian logistic regression to model binary win and lose scenarios in college basketball games held by National Collegiate Athletic Association (NCAA), a non-profit organization dedicated to offering high-quality, nationwide college-level sport competitions. So our research topic is associated with sport analytic - a rising field in data science. Other examples of Bayesian logistic regression include predicting the probability of international students to achieve over 105 points in TOEFL exam or 7.0 in IELTS exam, given how many books that candidate had read, how many hours of sleep and how many hours for test preparation. Therefore, same as the last regression example, we need some features called independent variables to provide useful information to predict the outcome called dependent variables. The only big difference is that the outcome is no longer continuous, but a binary result. Now, let's get started.

Question 2: What is log odd, odd and how to use log odd to define decision boundary for Logistic regression?

Before coding, we need to learn a terminology called Log Odd. We define the ratio of success over failure as odd. That is, we would divide the success probability over the failure probability (which is 1 - success probability) to get the odd. After that, we would take the log of the odd to get the log odd. Assume I can get up before 08:00 AM 60% of the time, then the success probability is 0.6.

Up till now we've looked at how the conversion of probability to odd and log odd looks like. Let's take a break here and continue the code journey in the next video!

Video 2: Bayesian Logistic Regression - Modeling

In the last video we've discussed the concept of Bayesian Logistic Regression. This time we'll start right off for the dataset.

Side Track: NCAA

The NCAA is a collegiate sport organization which hosts amazing sport events and competitions each year especially for college students. Some might not have heard of NCAA, so I picked two readings that are useful to get the background knowledge.

Read: https://www.ncaa.org/student-athletes/current

Read: https://en.wikipedia.org/wiki/NCAA_Division_I

Every March, the U.S. becomes captivated by the NCAA's Men's Basketball Tournament, also known as March Madness, in which 68 strong teams play single elimination tournament to fight for the nationship champion of NCAA Men's basketball. Except for the 4 matches to determine the 16th seeds, each of the 64 teams will be distributed to 4 regions. Each region has 16 teams seeded 1 to 16, depending on their AP rankings. To determine the match's result, we might want to take him to feature such as their AP ranking, the average point loss and gain throughout the season etc. In espn.com we might notice many cases when a lower seed as an underdog team won the game over a higher seed team in previous years. We want to predict the winners of games in these NCAA tournament games with the presence of these predictors, and find out what factors would affect how to fill out the Bracketology.

So our research lies primarily on predicting the winning probability and making decisions on who will win each of the tournament games based on seed, average point loss and average point gain. Because the result is binary, win or loss, nothing in between, so we can set up the Bayesian logistic regression model to measure such probability.

We have a rich dataset that records a total of 1024 NCAA tournament plays in the first round of march madness that spans more than 20 years. Although the data is far from comprehensive, (it only contains the match outcomes, team seed, and scores), we can try the Bayesian logistic regression and take the outcome as the binary result and see how accurate the predictions we can get.

Research Questions:

There may be various interesting questions to get an answer from our prior knowledge and from the data to gain insight about the NCAA tournament, and more specifically, many people may make better Bracketology predictions to clinch the leaderboard. Okay, I listed a few frequently asked questions.

Question 3: What kind of feature engineering should we consider before modeling this NCAA data?

It's clear that we should predict the winning margin be approximately 10 points in an average game.

Question 4: How could we create a Bayesian logistic regression model using PyMC3?

That's tough, we took 7 minutes to complete the sampling process, which is much longer compared to linear regression methods. It's because logistic regression is more complex in model computation. There is a link function underlying the model which distinguishes it from the linear model where we don't need a link function.

Question 4: Where does the decision boundary locate for Logistic regression?

Logistic regression specifically sets the decision boundary at probability 0.5. Larger and smaller than this critical value (p = 0.5) will lead the logistic regression to predict success, and failure accordingly. Now let's create a plot from the trace data statistics to predict winning probability.

Awesome! In this video, we've established a Bayesian logistic regression model. We've diagnosed patterns related to the traceplot, autocorrelation plot, and plot the change of winning probability by increasing or decreasing certain variables. One good takeaway is that after obtaining the trace object, you should ensure that

In the next video, we'll continue exploring the predictive power of Bayesian logistic regression.

Video 3: Bayesian Logistic Regression - Classification Prediction

In this video, we'll explore the predictive ability of Bayesian logistic regression on NCAA games. Moving from the dataset where we've added the predictive log odds onto it, we can find out how accurately our Bayesian logistic regression model predicts the outcome of NCAA games.

Question 5: What decision rule does Bayesian logistic regression implement?

In logistic regression, we make decision with a very straightforward rule:

Note that log odd 0 has the same meaning of probability = 0.5, which you can verify simply by looking back at video 1.

Question 6: What are typical metrics that Bayesian logistic regression uses for prediction?

In terms of metrics, let's define True Positive, True Negative, False Positive, False Negative based on the decision boundary of log_odd = 0. If you haven't heard of these four vocabularies,

There is a wikipedia reading if you are interested: https://en.wikipedia.org/wiki/Precision_and_recall

Based on the true positive, true negative, false positive and false negative, we've enough information to understand the goodness of model prediction. If we're more careful about predicting a winning team accurately, and hate the classification in which you misleadingly bet a team winning a game but it ends up losing, then we'd like to compute the precision. It's computed such that the number of true positives is compared against all positive predictions you've made.

$precision = \frac{True Positive}{True Positive + False Positive}$

But if we're more scrutinous on minimizing the predictions where we predict as losing but in fact it's winning, then we'd like to compute the recall. It's computed such that the number of true positives is compared against all predictions that should have been positive.

$recall = \frac{True Positive}{True Positive + False Negative}$

And finally for classification, the F-1 score tries to average off the precision and recall so that both concerns are treated equally important. And the formula of F-1 score is 2 times the product of precision and recall divided by the sum of precision and recall.

$F1 = \frac{2 * precision * recall}{precision + recall}$

There are still many aspects of Bayesian regression research we can explore on these wonderful topics. So far we've just discussed linear regression and logistic regression for TED talk prediction and NCAA winning probability and result predictions by looking at the posterior distribution. On the prediction side, we've used metrics such as precision, recall and F-1 score. There isn't any metric that is absolutely more useful than other metrics. But there are still many Bayesian methods that we should explore in the future. Already, let's move on to hypothesis testing in the next video! See you!

Appendix A: Concept of Logistic Regression

There are many ways to select an appropriate decision boundary, a few of which were covered in this course earlier. So now let's take a look at this concept of an inverse link function.

The inverse link function takes the form given by linking the output of linear regression equation. Think back to our linear regression equation, which was y equals alpha plus beta times x.

So x is our input variable here. All we're doing here is now applying a function f to that output y, and we're assigning that to an observed dataset. Here f is called the inverse link function. The term inverse here refers to the fact that this function is now applied to the right-hand side of the equation. Based on this concept, if you think about it in a linear equation, this inverse function is simply the identity function.

In the case of the linear regression model, the value y at any point x is modeled as the mean of a normal distribution centered at that point x, y. The error which is computed as the difference between your true value and your estimated value, is now modeled using a normal distribution. We usually parameterize that using the standard deviation of that normal distribution. Now think about the scenario where this is not appropriately modeled using a Gaussian distribution.

Classification prompt now is a perfect example of the scenario where the error from predicting these output classes or categories is not modeled well using a Gaussian or a normal distribution. As a result, we would like to convert the output $y = \alpha + \beta x$, which is what we get in a linear regression problem. To some other range of values that are more appropriate to the prom being modeled, which is exactly what typically the link function would do.

Now let's go ahead and define the logistic function.

Think back to our first course where you've visualized the logit function using the scipy special library. As you can see, the logistic function takes the x-axis value and maps it using one over one plus exponential to the power of negative x. The is also called the sigmoid function and the plot shows the property of logistic function that it takes values of x and maps it to between 0 and 1. Essentially you can see the output is now restricted to this range 0 and 1.

Appendix B: IELTS Prediction Problem

Logistic regression problem:

Research method: Assume the probability of finishing IELTS with overall score >= 7 is a high-score.

Now theta is not going to be generated from a beta distribution; instead, is going to be defined by a linear model with the logistic as the inverse link function

log(high-score / 1 - high-score) = initial-effect + effect-study x study + effect-sleep x sleep + effect-book x book + error term

Logarithm is the inverse of exponential function. Exponential function is a function with e as base. e is a number approximately equal to 2.71. log(exp(1)) = 1; exp(log(1)) = 1

Recap linear regression: y = initial-effect + effect-study x study + effect-sleep x sleep + effect-book x book + error term (homoscedastic error)

Interpretation 1: The bigger the log(odds) of achieving a high score, the higher the probability of such a candidate to achieve a high score.

At this point, we can't directly tell how large the probability is, but we can transform the log odds into probability.

If log(odds) = 0 -> log(p / 1 - p) = 0 -> exp(log(p / 1 - p)) = exp(0) -> p / 1-p = 1, what's the corresponding probability?

p / 1 - p = 1 -> p = 1 - p -> 2p = 1 -> p = 1/2 -> log(odds) = 0 is decision boundary

If our model result has log(odds) >0, we'll predict a success. If our model result has log(odds) less than 0, we'll predict the outcome as a failure.

Transformation:

We can obtain log odds directly from regression.

log(high-score / (1 - high-score) = initial-effect + effect-study x study + effect-sleep x sleep + effect-book x book + error term

exp(log(high-score / (1 - high-score)) = exp(initial-effect + effect-study x study + effect-sleep x sleep + effect-book x book + error term)

odds = high-score / 1 - high-score = exp(initial-effect + effect-study x study + effect-sleep x sleep + effect-book x book + error term)

Let exp(...) = a; high-score / 1 - high-score = a (Odds)

high-score = a x (1 - high-score) = a - a x high-score

high-score (1 + a) = a -> high-score = a / (1+a) = exp(initial-effect + effect-study x study + effect-sleep x sleep + effect-book x book + error term) / (1 + exp(initial-effect + effect-study x study + effect-sleep x sleep + effect-book x book + error term))

Probability = exp(something) / 1 + exp(something) , the probability must be some value between 0 and 1. Probability = (exp(something) / exp(something)) / (1 + exp(something)) / exp(something) = 1 / (exp(-something) + 1)

Probability = exp(log(odds)) / (1 + exp(log(odds))) = odds / ( 1 + odds)

From regression result, we can transform log odds to probability by

Created in deepnote.com Created in Deepnote