STATS 412

My Office Hour:

My office hours are on 16:30 - 18:00 Tuesday and 13:30 - 15:00 Friday, at . You may check the campus map to get to my office. I am prepared for your questions, so please feel free to come to my office hours. During the fall break, I have extra office hours on 14:00 - 16:00 Monday and 16:30 - 18:00 Tuesday before the midterm.

Calculus Review:

\(\bullet\) After the exam 1, the requirements for integration will be lowered. However, to compute the Maximum Likelihood Estimator (MLE), you may encounter the difficulty for partial differentiation. If you have studied MATH 215 or the equivalent class before, you may review the notes. If you do not know how to perform partial differentiation, you may refer to the following websites for reference:

\(\bullet\) http://tutorial.math.lamar.edu/Classes/CalcIII/PartialDerivsIntro.aspx

\(\bullet\) https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/introduction-to-partial-derivatives

These are great practices to prepare you with essential calculus skills and knowledge of distributions for the subsequent homework and the exam 2.

Future Homework Grading Policy:

Please include the final answer for each homework question. If the final answer is not included, you will risk 0.5 points for each missing part.

Key Points during Lecture 11:

Poisson Distribution Review:

It is clear in the lecture that Poisson distribution is an important and special distribution because its mean and variance are the same. That is, for a Poisson random variable \(X \sim Poisson(\lambda)\):

\[\begin{equation} \boxed{E(X) = Var(X) = \lambda} \end{equation}\] \[\begin{equation} \boxed{E(\bar X) = E(X) = \lambda, Var(\bar X) = \frac{\lambda}{n}} \end{equation}\]

Linear Functions of Normal Random Variables:

\(\bullet\) If \(X_1 \sim N(\mu_1, \sigma^2_1), X_2 \sim N(\mu_2, \sigma^2_2), ......, X_n \sim N(\mu_n, \sigma^2_n)\) are independent random variables, and \(c_1, c_2, ......, c_n\) are constants where at least one of them is not 0, then we have

\[c_1X_1 + c_2X_2 + ...... + c_nX_n \sim N\Big( \underbrace{c_1\mu_1 + c_2\mu_2 + ...... + c_n\mu_n}_{\text{=mean}},\underbrace{c_1^2 \sigma_1^2 + c_2^2\sigma^2_2 + ...... + c_n^2 \sigma^2_n}_{\text{=variance}}\Big)\]

Also, for example, if we have some subtractions, we have

\[c_1X_1 - c_2X_2 + c_3 X_3 - c_4 X_4 \sim N\Big( \underbrace{c_1\mu_1 - c_2\mu_2 + c_3\mu_3 -c_4 \mu_4 }_{\text{=mean}},\underbrace{c_1^2 \sigma_1^2 + c_2^2\sigma^2_2 + c_3^2 \sigma^2_3 + c_4^2 \sigma^2_4}_{\text{=variance}}\Big)\]

Distribution of Sample mean in normal distribution (Review):

\(\bullet\) Since \(\bar X = \frac{1}{n}\sum_{i=1}^n X_i\), we have

\[E(\bar X) = E\Big(\frac{1}{n}[X_1 + ...... + X_n]\Big) = \frac{1}{n} \cdot (n \mu) = \mu\]

\[Var(\bar X) = Var\Big(\frac{1}{n}[X_1 + ...... + X_n]\Big) = \frac{1}{n^2} \cdot (n \sigma^2) = \frac{1}{n}\sigma^2\]

Therefore, \(\bar X \sim N(\mu, \frac{\sigma^2}{n})\). This result is important because it articulates the fundamental concept in statistical inference that when the sample size increases, the variability of the sample mean will go down (more accurate).

p-value (Preview):

There are multiple different definitions for the p-value from didderent probability and statistics books. From the Casella and Berger’s book, the p-value is defined as the probability, under the null hypothesis (\(H_0\), the hypothesis that we would like to find evidence to reject), of obtaining a result **equal to or more extreme than} what was actually observed.

Maximum Likelihood Estimate (MLE):

As taught in the lecture, in the Poisson case, from the independence of \(X_1, X_2,...... X_{30}\) we use the multiplicative rule to set up the **likelihood function}. The likelihood function is in the form:

\[\begin{equation} \boxed{p(X_1 = x_1) \cdot p(X_2 = x_2) = ...... p(X_{30} = x_{30}) = \prod_{i=1}^{30} \frac{e^{-\lambda} \cdot \lambda^{x_i}}{x_i!} = \frac{e^{-30 \lambda} \lambda^{\sum_{i=1}^{30}x_i}}{\prod_{i=1}^{30}x_i!}} \end{equation}\]

The MLE can be obtained by finding the value of \(\lambda\) which maximizes this likelihood function. As taught in the lecture, we may take the derivative of the likelihood function with respect to \(\lambda\).

However, this is not the whole story. In many case, it is difficult to take the derivative for a complicated likelihood function due to the complexity of probability distribution. One of the remedy is to take log to the likelihood function; such function is widely known as log-likelihood. Since log is a monotone increasing function, we can take the derivative of the log-likehood function to obtain the same MLE result.

\(\bullet\) Maximum Likelihood Estimate (MLE) tends to be harder from the past experiences. Please ask questions during the office hours whenever you encounter trouble doing homework or revising the notes.

\(\bullet\) I will include an extra question about MLE before the exam for revision. Again, I hope that students can help each other and post possible solution for the extra problems and I will definitely check.

The probability plots:

Here is some comments about the probability plots. Firstly, if the observations fit perfectly to the \(y=x\) line, it means that the observations (samples) are perfectly fit to the population values (the population may follow a variety of distributions different from the normal distribution, but normal distribution is prominent). Then we are confident to conclude the observed sample also follows roughly the same distribution.

Points shift up or shift down:

It is not possible that all the points locate above the line. With the same logic, it is not possible that all the points locate below the line too. It may be the case that more points are located either above or below the line, but there should be at least one point located in the opposite direction of the line (you may also guess in which case it is right-skewed and in which case it is left-skewed, remember the definition).

Cauchy distribution (Optional):

Data with Cauchy distribution have longer-tails (or heavier tails) compared to the normal distribution. In other words, the Cauchy distribution has clear systematic departures from the normal distribution (the line). Look at the third plots in page , which looks approximately the Cauchy distribution.

Shapiro Wilk test (Optional):

Shapiro Wilk test is a test for normality based on correlation between the data and the corresponding z-scores. The R code to perform the Shapiro Wilk test for one variable is shapiro.test(). Reference website: http://www.sthda.com/english/wiki/normality-test-in-r

Anderson Darling test (Optional):}

Another for normality for normality. The R code to perform the Shapiro Wilk test for one variable is as follows:

install.packages("nortest")
library(nortest)
ad.test()

The data from the Anderson Darling test will inform us two results: A and p-value. The A value refers to the Anderson-Darling test statistics give the probability that the tested sample could have come from a normal distribution. Then, assuming that the null hypothesis (the process represents normally distributed data) is true, the p-value tells us the probability that our data follows the same distribution.

Reference website: https://rexplorations.wordpress.com/2015/08/11/normality-tests-in-r/

Key Points during Lecture 12:

Estimator and Estimate:

\(\bullet\) In statistics, we should be conscious about the differences between an estimator and an estimate. Generally, an estimator is a mathematical function to produce an estimate, while an estimate is a computed value based on data which it may be modeled as random variables. This notion makes the estimate into a random variable and the estimator into a function of the random variable.

\(\bullet\) In the book Introduction to Econometrics, the author James H. Stock, Mark W. Watson claim that an estimator is a function of a sample of data that to be drawn randomly from a population such that it gives an educated guess of the value from the population. An estimate is the numerical value of the estimator when it is calculated using data from a specific sample.

For more explanation, you are encouraged to look that the following posts:

\(\bullet\) In some cases, we may have different estimators producing the same value of the estimate.

\(\bullet\) During the class we are given some examples. \(\bar X\) is an estimator for \(\mu\), while \(\bar x\) is the estimate because it is derived from a sample. Estimate is the realized, observed value from data, while the estimator is a function.

MLE for normal distribution:

If, the joint PDF (likelihood function) is \(f(x_1, x_2,......, x_n;\mu,\sigma^2) = (2\pi)^{-\frac{n}{2}} (\sigma^2)^{-\frac{n}{2}} \cdot exp\Big(-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2\Big)\). Take the log we obtain the log-likelihood function:

\[\begin{equation} \boxed{ln(f(x_1,x_2,......,x_n; \mu,\sigma^2) = -\frac{n}{2} log(2\pi) - \frac{n}{2} log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n(x_i - \mu)^2} \end{equation}\]

To find out the MLE estimator for the mean, we take the first derivative with respect to \(\mu\) (I skipped the verification part which requires you to take the second derivative of \(\mu\) and conclude that the second derivative of the log-likelihood function is negative for all \(x\)), and set the derivative as 0 (if you are having trouble this step, please also revise the chain rule in calculus:

\[\begin{equation} \boxed{\frac{\partial ln(f(x_1,x_2,......,x_n; \mu, \sigma^2)}{\partial \mu} = - \frac{1}{2\sigma^2} \cdot 2 \cdot (-1) \cdot \sum_{i=1}^n(x_i - \mu) =^{set} 0 \rightarrow \sum_{i=1}^n - n\mu = 0 \rightarrow \hat \mu = \frac{\sum_{i=1}^n x_i}{n} = \bar X} \end{equation}\]

To find out the MLE estimator for the variance, we take the first derivative with respect to \(\sigma^2\) (Again, I skipped the verification part which requires you to take the second derivative of \(\sigma^2\) and find the result \(\frac{\partial^2 ln()}{\partial (\sigma^2)^2} = -\frac{n}{2}\) that the second derivative of the log-likelihood function is negative for all \(x\)), and set the derivative as 0 (if you are having trouble this step, please also revise the chain rule in calculus:

\[\begin{equation} \boxed{\frac{\partial ln(f(x_1,x_2,......,x_n; \mu, \sigma^2)}{\partial \sigma^2} = -\frac{n}{2} \frac{1}{\sigma^2}+ \frac{1}{2(\sigma^2)^2} \cdot \sum_{i=1}^n(x_i - \mu)^2 =^{set} 0 \rightarrow \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu)^2 = n \rightarrow \hat \sigma^2 = \frac{\sum_{i=1}^n(x_i-\mu)^2}{n}} \end{equation}\]

I added a hat on the top of the \(\mu\) and \(\sigma^2\) to distinguish that they are, in fact, an estimator, In addition, for the variance, if we consider the sample variance \(S^2 =\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2\), we the MLE of the variance can be rewritten as \(\hat \sigma^2 = \frac{n-1}{n}S^2\).

Properties of MLE:

\(\bullet\) Since the MLE does not necessarily exist and unique, the derivation of MLE may not work.

\(\bullet\) Invariance Principle: For example, \(\hat \sigma^2 = \frac{n-1}{n}S^2\) implies \(\hat \sigma = \sqrt{\frac{n-1}{n}S^2}\).

\(\bullet\) Since MLE approximates the minimum variance (MVUE), it does not necessary achieve the minimum variance. Students do not need to revise whether MLE achieves the minumum variance for the exam 2.

Gamma distribution and its Family:

The probability density function of gamma(\(\alpha,\beta\)) distribution is given by:

\[\begin{equation} \boxed{f(x;\alpha,\beta) = \begin{cases} \frac{x^{\alpha-1}e^{-\frac{x}{\beta}}}{\beta^{\alpha}\Gamma(\alpha)},~\alpha>0,~\beta>0,~0<x<\infty\\ 0, \ otherwise \end{cases}} \end{equation}\]

Then, its expectation and variance are derived by:

\[\begin{equation} \boxed{E(X)=\alpha\cdot\beta} \end{equation}\] \[\begin{equation} \boxed{V(X)=\alpha\cdot\beta^2} \end{equation}\]

From the lecture, we may have trouble in performing integrating by part for

\[\begin{equation} \boxed{f(x) = \begin{cases} \frac{9}{4} x e^{-\frac{3x}{2}}, 0<x<\infty\\ 0, \ otherwise \end{cases}} \end{equation}\]

In this case, to compare the coefficients, we find that \(\alpha -1 = 1 \rightarrow \alpha = 2\) and \(\frac{1}{\beta} = \frac{3}{2}\). Therefore, we conclude that it is \(Gamma(2,\frac{2}{3})\) distribution. Hence, the expectation and variance of this case are:

\[\begin{equation} \boxed{E(X)=\alpha\cdot\beta = 2 \cdot \frac{2}{3} = \frac{4}{3}} \end{equation}\] \[\begin{equation} \boxed{V(X)=\alpha\cdot\beta^2 = 2 \cdot \Big(\frac{2}{3}\Big)^2 = \frac{8}{9}} \end{equation}\]

Key Points during Lecture 13:

Large Sample Size:

Regarding the notion of large sample size, in our class we use \(n \ge 30\) as criteria.

Continuity correction:

For a sample from Binomial distribution, as we perform the binomial approximation to normal, we will employ continuity correction to derive the probabilities.

\(\bullet\) For example, if we role 100 dices, what if the probability that 20 or more are 6? For this one, since the sample size is \(n = 100 \ge 30\), we have a large sample so that we can use Binomial approximation to Normal and calculate the mean as \(\mu = np = 100 \cdot \Big( \frac{1}{6}\Big) = \frac{50}{3}\) and the standard deviation as \(\sigma = \sqrt{np(1-p)} = \sqrt{100 \cdot \frac{1}{6}\frac{5}{6}} = \sqrt{13.89} = 3.727\). Here we have to be particularly careful, \(P(X\ge 20) \approx P(X>19.5) = P\Big(Z> \frac{19.5 - \frac{50}{3}}{3.727}\Big) = P(Z>1.506) = 1 - 0.9339 = 0.0661\). Here, \(P(X \ge n) \rightarrow P(X > n- 0.5)\), and we also express \(P(X=n) \rightarrow P(n-0.5 < X < n+0.5)\)

Reference: https://www.statisticshowto.datasciencecentral.com/what-is-the-continuity-correction-factor/

p-value:

There are multiple different definitions for the p-value from didderent probability and statistics books. From the Casella and Berger’s book, the p-value is defined as the probability, under the null hypothesis (\(H_0\), the hypothesis that we would like to find evidence to reject), of obtaining a result equal to or more extreme than what was actually observed.

MLE for normal distribution (Review):

If \(X_1, X_2,......X_n\) are independent normally distributed random variables, the joint PDF (likelihood function) is \(f(x_1, x_2,......, x_n;\mu,\sigma^2) = (2\pi)^{-\frac{n}{2}} (\sigma^2)^{-\frac{n}{2}} \cdot exp\Big(-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2\Big)\). Take the log we obtain the log-likelihood function: