In this lecture, we are going to discuss some of the applications of the multi-armed bandit problems to tackle the Explore-Exploit Dilemma that we will learn in these few weeks. Honestly, the math formulas involved in solving the explore-exploit dilemma is scary. But among all the topics I taught so far, I can confidently consider this as one of the most practical algorithms you'll ever learn. We'll manage to learn how to write programs to automate Bayesian experimentations later in this course.
The concept of comparing things to see which one is better can be applied to almost any business. Imagine you are advertising new iPhone products, or you just created a brand new IPhone and you ask your designers to make a couple of advertisements. Then your designers show you 3 ads and you really love them because the designers are ingenious. So, which should you choose? Of course, you should naturally choose the advertisement for the iPhone that you believe most likely to lead the user to click and purchase.
One possible way to measure your decision is to determine the click-through rate of each advertisement. It is the ratio of the number of user clicks divided by the total number of impressions. For example, if I show an advertisement to 10,000 users but only 1000 of those users clicked and purchased the new iPhone, then the click through rate is 10%.
Now the question is: how to measure the click through rates in practice? We can do experiments! We can show the first advertisement to 10,000 users, the second advertisement to another 10,000 users, and the third advertisement to yet another 10,000 users. And then I compare the click through rates of the three advertisements and find out which advertisement attracts the most users. But there is a problem.
How do I know 10,000 users are enough or they're too much? In fact, I don't! Why don't I show these advertisements to 1 million users? Why don't they just show the ads just 100 times? Theoretically, we would need an infinite number of samples. To get absolutely precise estimates for the click through rate. As we have more samples, the probability distribution is expected to shrink towards the center values so it becomes more precise.
If one advertisement is better, it necessarily means that another advertisement is worse. This may indicate that the better advertisement will generate more profits, while the worse advertisement will yield less profits; if I show the worst advertisement 10,000 times, I've wasted 10,000 impressions for a suboptimal click through rate. We remember from the beginning that our goal is to always show only the best ad to exploit the highest click through rate.
But there is a problem. If we don't know an accurate answer for which ad yields the best profits in the first place, we can't confidently bet on a single advertisement. In other words, we always need to explore all possible options until we're absolutely sure about the best option. But in order to find the best option, we must learn the consequences of its actions through trial and error, rather than being explicitly told the correct action.
This dilemma exemplifies the explore-exploit dilemma.
In this week or two, we'll look at how to use the Gaussian distribution and Bernoulli Distribution to optimize the strategy of making choices. If you recall,
Building on top of these concepts, over the years of computational advancement, people have tested strategies such as the epsilon greedy method, the upper confidence bound method, Thompson sampling method, all of which are useful such that we'll look into them in the following tutorials.
Now, obviously, Apple is not the only company in the world selling things -- the online advertising industry is a billion dollar industry. So the techniques you're learning how to solve the explore-exploit dilemma are transferable to almost all modern businesses. However, you don't have to own a successful business to use these techniques.
You can simply be someone who creates and owns a website on the internet. Assuming you have two or more traffic options available to choose for the website, you can apply these techniques to pick the best service option. In educational settings, if you're offering a new course as I do, and you're not sure if multiple choice questions, or coding problems can better help interested novice programmers to learn programming, you can test the learning process and the learning outcome among various types of assessment using some of these techniques.
In course 2, we have explored Bayesian A/B testing, where we could determine whether a new web design works better by performing a controlled experiment. In the A/B testing - controlled experimentation, we randomly assign an incoming web user to one of the two groups: Group A or Group B. This random assignment of users to groups continues until the web developer becomes convinced that either Design A is more successful than Design B, or Design B is more successful than Design A! After that, the web developer should assign all future users to the more successful version of the web design and eliminate the inferior version. The Bayesian A/B approach has been successful in the past and will continue to thrive in many contexts. So why should we teach the bandit algorithms?
Let's look back to the A/B testing framework. A/B testing consists of:
So what's the limitation of A/B testing?
Given these drawbacks, the bandit algorithms focus our resources during the exploration step on the more successful options, instead of over-exploring the inferior options. These algorithms gradually fixate on the best available option by recording all past performances over time, and a good bandit algorithm will ultimately converge.
Taking A/B testing and bandit problems together, the framework to maneuver between exploration and exploitation will be useful whichever method you choose because bandit algorithms subsume A/B testing as a special case (i.e. the extreme case in which you jump from pure exploration to pure exploitation). I hope you now gain a sense that understanding bandit algorithms will empower you to operate in the much larger and more flexible space between the two extremes and widen the lens of addressing explore-exploit tradeoff for real-life problems. In order to see how bandit algorithms achieve that balance, let’s get started!