Introduction

This KNN tutorial will go through the technical part of KNN: 1) Train and test error according to the choice of K, 2) Cross-validation in KNN, 3) Validation process.

This time, I will demonstrate the process of 1) generating i.i.d. Gaussian data (synthetic data), set random seeds, 2) Make visualization, 3) Test KNN for different K, 4) Make Train and Test sets and Perform Cross-Validation

Step 1: Generating i.i.d. Gaussian Data

For classification purpose, we are generating two classes of data from Gaussian (Normal) distributions with different means. We first generate training data and will use this training data set to choose the K for K-nearest-neighbors. We also generate some test data (not allowed to know about during training). Both the training and test data set should be taken with care.

# install.package("mvtnorm") for multivariate normal distribution
# install.packages("flexclust") , a package originally for Flexible Cluster Algorithms
library(mvtnorm)
## Warning: package 'mvtnorm' was built under R version 3.4.3
library(flexclust)
## Warning: package 'flexclust' was built under R version 3.4.4
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(class)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.2
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'purrr' was built under R version 3.4.2
## Warning: package 'dplyr' was built under R version 3.4.2
## Warning: package 'stringr' was built under R version 3.4.3
## Warning: package 'forcats' was built under R version 3.4.3
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(stringr)

Now, we generate the synthetic data, for both training and test datasets. We can use Gaussian Mixture Model to generate such data. Our goal is to generate the training and test dataset with 2 different class, each with different mean.

# Set the same seed of mixture means for both training and test sets
mean_seed <- 238

# Draw train and test data
# Gaussian mixture model
GMM2d_distribution <- function(n_neg, n_pos, mean_seed=NA, data_seed=NA){
    # 2 class gaussian mixture model distribution
    # grand class means
    grand_mu_pos <- c(1, 0)
    grand_mu_neg <- c(0, 1)
    
    # If mean_seed is provided, sample the means from the mean seed for reproducibility
    if(!is.na(mean_seed)){
        set.seed(mean_seed)
        
        # Sample class means
        mu_pos <- rmvnorm(n=10, mean=grand_mu_pos, sigma=diag(2))
        mu_neg <- rmvnorm(n=10, mean=grand_mu_neg, sigma=diag(2))
        
        # If the data seed is not set then remove the seed
        if(is.na(data_seed)){
            rm(.Random.seed, envir=globalenv())
        }
    } else{ # If mean_seed is not provided
        # Sample class means
        mu_pos <- rmvnorm(n=10, mean=grand_mu_pos, sigma=diag(2))
        mu_neg <- rmvnorm(n=10, mean=grand_mu_neg, sigma=diag(2))
    }
    
    # If the data seed is provided, then set the data seed
    if(!is.na(data_seed)){
        set.seed(data_seed)
    }
    
    # Use sample function to pick which means to sample from
    m_index_pos <- sample.int(10, n_pos, replace = TRUE)
    m_index_neg <- sample.int(10, n_neg, replace = TRUE)
    
    # Sample data from each class
    X_pos <- map(1:n_pos,function(i) rmvnorm(n=1, mean=mu_pos[m_index_pos[i], ], sigma=diag(2)/5)) %>% 
        unlist %>% 
        matrix(ncol=n_pos) %>% 
        t %>%
        as_tibble() %>%
        mutate(y=1) 
    
    
    X_neg <- map(1:n_neg,function(i) rmvnorm(n=1, mean=mu_neg[m_index_neg[i], ], sigma=diag(2)/5)) %>% 
        unlist %>% 
        matrix(ncol=n_neg) %>% 
        t %>%
        as_tibble() %>%
        mutate(y=-1) 
    
    # Set column names
    colnames(X_pos) <- gsub("V", 'x', colnames(X_pos))
    colnames(X_neg) <- gsub("V", 'x', colnames(X_neg))
    
    # Put data into one data frame
    data <- rbind(X_pos, X_neg)%>% 
        mutate(y =factor(y)) # Finally, class label should be a factor
    
    data
    
}
# We create a small sample of train data, well 50, 51 from each class
data <- GMM2d_distribution(n_neg=50, n_pos=51, mean_seed=mean_seed, data_seed=1232)
# We also create a test data for validation
test_data <- GMM2d_distribution(n_neg=100, n_pos=100, mean_seed=mean_seed, data_seed=52345)

Step 2 - Make Visualizations

# Create a function to plot the KNN graph 
two_class_gaussian_plot <- function(n_pos, n_neg, mu_pos, mu_neg, sigma_pos, sigma_neg, seed=NA){
    
    if(!is.na(seed)){
        set.seed(seed)
    }
    
    # Generate data from negative class
    class_neg <- rmvnorm(n=n_neg, mean=mu_neg, sigma=sigma_neg) %>% # mvrnorm comes from MASS
        as_tibble() %>%
        mutate(y=-1) %>%
        rename(x1=V1, x2=V2)
    
    # Generate data from positive class
    class_pos <- rmvnorm(n=n_pos, mean=mu_pos, sigma=sigma_pos) %>% 
        as_tibble() %>%
        mutate(y=1) %>%
        rename(x1=V1, x2=V2)
    
    # Put data into one data frame
    data <- rbind(class_pos, class_neg)%>% 
        mutate(y =factor(y)) # class label should be a factor
    
    data
}

data1 <- two_class_gaussian_plot(n_pos=50, n_neg=51,
                                             mu_pos=c(1,0), mu_neg=c(-1,0),
                                             sigma_pos=diag(2), sigma_neg=diag(2),
                                             seed=100)
 
test_data1 <- two_class_gaussian_plot(n_pos=100, n_neg=100,
                                             mu_pos=c(1,0), mu_neg=c(-1,0),
                                             sigma_pos=diag(2), sigma_neg=diag(2),
                                             seed=3240)

# Plot the training data
ggplot()+
    geom_point(data=data1, aes(x=x1, y=x2, color=y, shape=y)) +
    theme(panel.background = element_blank()) +
    ggtitle('training data')

Step 3 - Test KNN for different K:

We typically perform the case when K = 1, K=5 and so on.

According to the above codes, the training error rate is 0.049505 and the test error is 0.04. Notice the training error is better than the test error. It’s almost always true that a statistical algorithm (KNN is an algorithm with a trained classifier) will perform better on the data it was trained on than on an independent test set (hence the problem of overfitting).

KNN for different values of K

Now let’s look at the predictions resulting from KNN for different values of K. First we show what the predictions will be at every point in the plane (or really every point in our test grid).

k_values <- c(1, 3, 5, 9, 15, 31, 51, 101)

Some observations to note:

\(\bullet\) The predictions seem to get simpler in some sense as k gets larger (i.e. the first plot has the most complicated looking predictions).

\(\bullet\) The last plot is the most simple; every point is classified to a single class. For this plot, k = 101 = total number of training points.

\(\bullet\) The training error goes up as k goes up.

\(\bullet\) The test error goes down then up as k goes up. Let’s dig into the test/training error a little more.

Test/train error as a function of K

Let’s compute the training and test error for a bunch of values of k

# Specify the values of K to use
k_values <- c(3, 7, 9, seq(from=1, to=101, by=4))
# Number of k values to check
num_k <- length(k_values)
# Initialize data frame to save error rates in
error_df <- tibble(k=rep(0, num_k),
                    tr=rep(0, num_k),
                    tst=rep(0, num_k))
# Evaluate knn for a bunch of values of k
for(i in 1:num_k){
    
    # Fix k for this loop iteration
    k <- k_values[i]
    # It computes the train/test errors for knn
    errs <- get_knn_error_rates(data, test_data, k)
    # Store values in the data frame
    error_df[i, 'k'] <- k
    error_df[i, 'tr'] <- errs[['tr']]
    error_df[i, 'tst'] <- errs[['tst']]
}

error_df
## # A tibble: 29 x 3
##        k     tr    tst
##    <dbl>  <dbl>  <dbl>
##  1    3. 0.0594 0.0450
##  2    7. 0.0495 0.0400
##  3    9. 0.0693 0.0200
##  4    1. 0.     0.0650
##  5    5. 0.0495 0.0400
##  6    9. 0.0693 0.0200
##  7   13. 0.0792 0.0350
##  8   17. 0.0891 0.0350
##  9   21. 0.0693 0.0450
## 10   25. 0.0891 0.0450
## # ... with 19 more rows

The above tibble shows the trend of train and test error by taking different K values. Meanwhile, we want to plot the train/test error as a function of k.

The y axis shows the error rate (percentage of misclassified points) and the x-axis is k (number of neighbors). The error rates for both the training data and the test data are shown.

# Note the use of gather!
error_df %>% 
    gather(key='type', value='error', tr, tst) %>% 
    ggplot() +
    geom_point(aes(x=k, y=error, color=type, shape=type)) +
    geom_line(aes(x=k, y=error, color=type, linetype=type))

A few observations to make about this plot:

\(\bullet\) 1) The training error is an increasing function of k (although not the case occasionally) \(\bullet\) 2) The test error has an inverted U shape \(\bullet\) 3) k = 1 gives 0 training error \(\bullet\) 4) k = 101 (= number of training points) has about at 50% error rate

The main takeaway, however, is the k that gives the best training error is not the same as the k that gives the best test error. Look at the graph, could you find out which k gives the least train and test error? If not, you can refer to the code below:

# Minimum training error, this is obvious from the explanation above
error_df %>% 
    filter(tr==min(tr))
## # A tibble: 1 x 3
##       k    tr    tst
##   <dbl> <dbl>  <dbl>
## 1    1.    0. 0.0650
# Minimum test error 
error_df %>% 
    filter(tst==min(tst))
## # A tibble: 2 x 3
##       k     tr    tst
##   <dbl>  <dbl>  <dbl>
## 1    9. 0.0693 0.0200
## 2    9. 0.0693 0.0200

Some Notes:

\(\bullet\) When the training error is really good but the test error gets worse we are overfitting. In reality for predictive modeling we don’t know the true distribution or have access to the “test” data we really care about (at least at the time we are training the model).

\(\bullet\) Think about that we actually do not know when we are overfitting the KNN model and losing out on performance. So, if we naively just look at the training error we will fool ourselves and pick a sub-optimal value of k. So what do we do to ameliorate this situation?

Step 4 - Validation set:

It finally comes to the validation, which is another very technical step in KNN, and, generally, all Machine Learning approaches. In real life we are given a fixed training data set and we have to pick the optimal value of k.

We can’t see the test data set while training the model so what do we do? You have some time to think and listen to the lecture. This answer varies but I will write my code and my explanation. Compare and contrast, that is one very good way to learn machine learning in various angles.