The office hours this week are on 16:30 - 18:00 Tuesday and 13:30 - 15:00 Friday. You may check the campus map to get to my office. I am prepared for your questions, so please feel free to come to my office hours.
\(\bullet\) I will upload the questions about MLE and MSE before Wednesday, so you may ask these review question in additional to the homework 9 problems.
\(\bullet\) I will also give out hypothesis testing and confidence interval revision problems on Wednesday. Good luck on your revision. I will write out solution on Friday, but before seeing the solution, I STRONGLY ENCOURAGE you to discuss with friends or do by your own.
Please include the final answer for each homework question. If the final answer is not included, you will risk 0.5 points for each missing part.
A statistic is defined as a numerical value obtained from a sample. Therefore, a statistic represents just a fraction of the population. We typically use statistics to estimate the parameter.
So what is a parameter? A parameter is a fixed numerical value, or a true value of the population; it reflects the aggregate of all population members under consideration. The difference between these two are described in the following website in detail.
Reference: https://keydifferences.com/difference-between-statistic-and-parameter.html
\(\bullet\) The first step is the define the parameter correctly! The parameter should be the population/true mean or difference between the population/true mean of A and B. If you mention sample mean or mean, you will be taken points off.
\(\bullet\) To check the conditions, first, we have to check whether the sample is random. If it is told, great! Just move on! If not, we can either explain by your own reasoning why you think the collect sample is random or not, or draw a scatterplot if the sample is given (if you have great sense of calculation you can explain clearly by plain words). This is important because we need randomness of sample to perform either t-test or z-test.
Then, we check
\(\bullet\) 1. Whether the underlying population distributon is normal, we check if we are told that the distribution of the population from which the measurements are taken is normal.
\(\bullet\) 2. If not 1, then we check whether the sample size is large enough \(n\ge 30\) to employ the central limit theorem, which claims that the sampling distribution of the sample mean of the measurements is approximately normal.
\(\bullet\) 3. If not 1 and not 2, we may use QQ-plot (or perform normality test) to look at the data to see if the population seems to be normally distributed.
\(\bullet\) 4. If the data is not given or the sample size is small \(n<30\), then we rely on the robustness of the t procedures against violations of normality. In short, we should check (not PROVE) the randomness and normality.
Then you can construct confidence interval as follows:
\(\bullet\) If \(\sigma\) is given, skip this part. Otherwise, compute \(\sigma\), which is the square root of the sample variance.
\(\bullet\) Find the t-score or z-score by referencing the table, given the specified significance level. For two-tails (keyword: between,\(\pm\) ) , we find \(t_{\frac{\alpha}{2}}\) or \(z_{\frac{\alpha}{2}}\); for one-tail (keyword: no greater than, no less than), we find the confidence upper bound (CUB) or confidence lower bound (CLB) by finding \(t_{\alpha}\) or \(z_{\alpha}\)
\(\bullet\) We are approximately (95%/99%) confident that the population mean of …..what question say…….. is (between /no greater than/no smaller than) …the confidence interval….
\(\bullet\) With a p-value greater than the significant level (such as 0.05, 0.01), we fail to reject \(H_0\), the null hypothesis. And we say: there is not sufficient evidence to suggest that the population mean of …..what question say……… is (different from/greater than/smaller than) …the number….
Before reading this, please tell yourself do not get confused with the one-sample hypothesis testing and confidence interval, which is stated on class notes 15 and 16.
Now,
\(\bullet\) Our parameter of interest is \(\mu_X - \mu_Y\), which is the .
\(\bullet\) Hypothesis: In most cases we construct the hypothesis as the follows: \(H_0: \mu_X - \mu_Y = 0\) vs \(H_1: \mu_X - \mu_Y \ne 0\) (two tails), you may check your notes for the one-tail situation.
\(\bullet\) If the two samples are given random, you are good to go! If not, we need to assume independence within the two samples and proceed.
\(\bullet\) If the underlying distribution is given normal, you are good to go! If not, we first check the sample size. If the sample size is large, then we can use the CLT to say that the sampling distribution for the difference between sample means is approximately normal. If the sample size is small, we cannot use CLT and need to rely on the robustness of t-distribution and perform the two sample t-test.
\(\bullet\) Sample statistics: First we compute the degree of freedom for two sample t-test.
\(\bullet\) With a p-value lower than the significant level (such as 0.05, 0.01), we reject \(H_0\), the null hypothesis. And we say: There is sufficient evidence to suggest that the population mean of A is (different from/greater than/smaller than) the population mean of B.
\(\bullet\) With a p-value greater than the significant level (such as 0.05, 0.01), we fail to reject \(H_0\), the null hypothesis. And we say: There is not sufficient evidence to suggest that the population mean of A is (different from/greater than/smaller than) the population mean of B.
During the class 17 (Monday) there are concerns about using the z-score and t-score. Professor’s explanation is that, after checking the randomness:
\(\bullet\) If one of the sample size is small (\(n<30\)), use t-test. Also, you could plot a normal Q-Q plot to see whether both samples are approximately normal. If yes, you are good to use either z-test or t-test. In short, using t-test guarantees correctness.
\(\bullet\) If the population standard deviation \(\sigma_X\) or \(\sigma_Y\) (for either sample) is not known, use t-test.
\(\bullet\) If the sample size is large, underlying distribution is given normal and the population standard deviation is known, use z-test.
\(\bullet\) If the sample size is large (\(n \ge 30\)), underlying distribution is unknown and the population standard deviation is known, we still need to employ Central Limit Theorem, and use z-test with approximated probability \(P(\bar X>t) \approx P(Z>\frac{t-\mu}{\sigma})\).
\(\bullet\) The first step is the define the parameter correctly! The parameter should be the difference between the population/true mean of A and B. If you mention sample mean or mean, you will be taken points off.
\(\bullet\) To check the conditions, first, we have to check whether the two samples are random. If it is told, great! Just move on! If not, we can either explain by your own reasoning why you think the two samples are random or not, or draw a scatterplot if the samples are given (if you have great sense of calculation you can explain clearly by plain words). This is important because we need randomness of samples to perform either t-test or z-test.
Then, we check
\(\bullet\) 1. Whether the underlying population distributons of two samples are normal, we check if we are told that the distributions of the two populations from which the measurements are taken are normal.
\(\bullet\) 2. If not 1, then we check whether the sample sizes are large enough \(n\ge 30\) to employ the central limit theorem, which claims that the sampling distributions of the sample means of the measurements are approximately normal.
\(\bullet\) 3. If not 1 and not 2, we may use QQ-plot (or perform normality test) to look at the data to see if the populations seem to be normally distributed. (It is a good way to practice in homework)
\(\bullet\) 4. If the data are not given or the sample size are small \(n<30\), then we rely on the robustness of the t procedures against violations of normality. In short, we should check (not PROVE) the randomness and normality.
Then you can construct confidence interval as follows:
\(\bullet\) We compute the degree of freedom v using the same formula above for the hypothesis testing.
\(\bullet\) If \(\sigma_X\) and \(\sigma_Y\) are given, compute \(\sqrt{\frac{\sigma_X^2}{n_X} + \frac{\sigma_Y^2}{n_Y}}\). Otherwise, compute \(s_X\) and \(s_Y\), which are the square root of the sample variances. Then, compute \(\sqrt{\frac{s_X^2}{n_X} + \frac{s_Y^2}{n_Y}}\).
\(\bullet\) Find the t-score or z-score by referencing the table, given the specified significance level. For two-tails (keyword: between,\(\pm\) ) , we find \(t_{v, \frac{\alpha}{2}}\) or \(z_{\frac{\alpha}{2}}\), and the confidence interval is (most often)
\[\begin{equation} \boxed{(\bar x - \bar y) \pm t_{v, \frac{\alpha}{2}} \cdot \sqrt{\frac{s_X^2}{n_X} + \frac{s_Y^2}{n_Y}}} \end{equation}\]\(\bullet\) For one-tail (keyword: no greater than, no less than), we find the confidence upper bound (CUB) or confidence lower bound (CLB) by finding \(t_{v, \alpha}\) or \(z_{\alpha}\). The confidence upper bound is \((\bar x - \bar y) + t_{v, \alpha} \cdot \sqrt{\frac{s_X^2}{n_X} + \frac{s_Y^2}{n_Y}}\) and the confidence lower bound is \((\bar x - \bar y) - t_{v, \alpha} \cdot \sqrt{\frac{s_X^2}{n_X} + \frac{s_Y^2}{n_Y}}\)
\(\bullet\) We are approximately (95%/99%) confident that the difference between population means of ….what question say….. is (between /no greater than/no smaller than) …the result….
Please inform me to fix the typos and grammatical mistakes if they exist. It is a great practice of writing and I appreciate your help!
Comments for degree of freedom for two-sample t-test:}
Assuming equal variance or not significant change the resulting degree of freedom. Many other probability books may assume equal variance but our class does not.
In our class, we find the degree of freedom by:
\[\begin{equation} \boxed{v = \frac{\Big(\frac{s_X^2}{n_X} + \frac{s_Y^2}{n_Y}\Big)^2}{\frac{1}{n_x-1}\cdot (\frac{s_X^2}{n_X})^2 + \frac{1}{n_Y -1} \cdot (\frac{s_Y^2}{n_Y})^2}} \end{equation}\]We typically round the degree of freedom down to the nearest integer. To see how this different is addressed, welcome to browse the following website:
Reference: https://www.statsdirect.co.uk/help/parametric_methods/utt.htm
\(\bullet\) Sample statistics: We derive the t-statistics by:
\[\begin{equation} \boxed{t = \frac{(\bar x - \bar y) - 0}{\sqrt{\frac{s_X^2}{n_X}+ \frac{s_Y^2}{n_Y}}}} \end{equation}\]Upon finding the t-value we refer back to the t-table and find the range of the corresponding p-value.
\(\bullet\) If the hypothesis testing is one-tail, we just use that p-value; if the hypothesis testing is two-tail, we need to multiple the p-value by 2.