This is “Estimation and Prediction”, section 10.7 from the book Beginning Statistics (v. 1.0). For details on it (including licensing), click here.
For more information on the source of this book, or why it is available for free, please see the project's home page. You can browse or download additional books there. To download a .zip file containing this book to use offline, simply click here.
Consider the following pairs of problems, in the context of Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", the automobile age and value example.
The method of solution and answer to the first question in each pair, (1a) and (2a), are the same. When we set x equal to 4 in the least squares regression equation $\widehat{y}=\text{\u2212}2.05x+32.83$ that was computed in part (c) of Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", the number returned,
$$\widehat{y}=\text{\u2212}2.05\left(4\right)+32.83=24.63$$which corresponds to value $24,630, is an estimate of precisely the number sought in question (1a): the mean $E\left(y\right)$ of all y values when x = 4. Since nothing is known about the first four-year-old automobile of this make and model that Shylock will encounter, our best guess as to its value is the mean value $E\left(y\right)$ of all such automobiles, the number 24.63 or $24,630, computed in the same way.
The answers to the second part of each question differ. In question (1b) we are trying to estimate a population parameter: the mean of the all the y-values in the sub-population picked out by the value x = 4, that is, the average value of all four-year-old automobiles. In question (2b), however, we are not trying to capture a fixed parameter, but the value of the random variable y in one trial of an experiment: examine the first four-year-old car Shylock encounters. In the first case we seek to construct a confidence interval in the same sense that we have done before. In the second case the situation is different, and the interval constructed has a different name, prediction interval. In the second case we are trying to “predict” where a the value of a random variable will take its value.
where
The assumptions listed in Section 10.3 "Modelling Linear Relationships with Randomness Present" must hold.
The formula for the prediction interval is identical except for the presence of the number 1 underneath the square root sign. This means that the prediction interval is always wider than the confidence interval at the same confidence level and value of x. In practice the presence of the number 1 tends to make it much wider.
where
The assumptions listed in Section 10.3 "Modelling Linear Relationships with Randomness Present" must hold.
Using the sample data of Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", recorded in Table 10.3 "Data on Age and Value of Used Automobiles of a Specific Make and Model", construct a 95% confidence interval for the average value of all three-and-one-half-year-old automobiles of this make and model.
Solution:
Solving this problem is merely a matter of finding the values of ${\widehat{y}}_{p}$, $\alpha $ and ${t}_{\alpha \u22152}$, ${s}_{\mathit{\epsilon}}$, $\stackrel{-}{x}$, and $S{S}_{xx}$ and inserting them into the confidence interval formula given just above. Most of these quantities are already known. From Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", $S{S}_{xx}=14$ and $\stackrel{-}{x}=4.$ From Note 10.31 "Example 7" in Section 10.5 "Statistical Inferences About ", ${s}_{\mathit{\epsilon}}=1.902169814.$
From the statement of the problem ${x}_{p}=3.5$, the value of x of interest. The value of ${\widehat{y}}_{p}$ is the number given by the regression equation, which by Note 10.19 "Example 3" is $\widehat{y}=\text{\u2212}2.05x+32.83$, when $x={x}_{p}$, that is, when x = 3.5. Thus here ${\widehat{y}}_{p}=\text{\u2212}2.05\left(3.5\right)+32.83=25.655.$
Lastly, confidence level 95% means that $\alpha =1-0.95=0.05$ so $\alpha \u22152=0.025.$ Since the sample size is n = 10, there are $n\text{\u2212}2=8$ degrees of freedom. By Figure 12.3 "Critical Values of ", ${t}_{0.025}=2.306.$ Thus
$$\begin{array}{lll}\hfill {\widehat{y}}_{p}\pm {t}_{\alpha \u22152}\text{\hspace{0.17em}}{s}_{\mathit{\epsilon}}\text{\hspace{0.17em}}\sqrt{\frac{1}{n}+\frac{{\left({x}_{p}-\stackrel{-}{x}\right)}^{2}}{S{S}_{xx}}}& =& 25.655\pm \left(2.306\right)\left(1.902169814\right)\sqrt{\frac{1}{10}+\frac{{\left(3.5-4\right)}^{2}}{14}}\\ \hfill & =& 25.655\pm 4.386403591\sqrt{0.1178571429}\\ \hfill & =& 25.655\pm 1.506\end{array}$$which gives the interval $\left(24.\mathrm{149,27}.161\right).$
We are 95% confident that the average value of all three-and-one-half-year-old vehicles of this make and model is between $24,149 and $27,161.
Using the sample data of Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", recorded in Table 10.3 "Data on Age and Value of Used Automobiles of a Specific Make and Model", construct a 95% prediction interval for the predicted value of a randomly selected three-and-one-half-year-old automobile of this make and model.
Solution:
The computations for this example are identical to those of the previous example, except that now there is the extra number 1 beneath the square root sign. Since we were careful to record the intermediate results of that computation, we have immediately that the 95% prediction interval is
$${\widehat{y}}_{p}\pm {t}_{\alpha \u22152}\text{\hspace{0.17em}}{s}_{\mathit{\epsilon}}\text{\hspace{0.17em}}\sqrt{1+\frac{1}{n}+\frac{{\left({x}_{p}-\stackrel{-}{x}\right)}^{2}}{S{S}_{xx}}}=25.655\pm 4.386403591\sqrt{1.1178571429}=25.655\pm 4.638$$which gives the interval $\left(21.\mathrm{017,30}.293\right).$
We are 95% confident that the value of a randomly selected three-and-one-half-year-old vehicle of this make and model is between $21,017 and $30,293.
Note what an enormous difference the presence of the extra number 1 under the square root sign made. The prediction interval is about two-and-one-half times wider than the confidence interval at the same level of confidence.
For the Basic and Application exercises in this section use the computations that were done for the exercises with the same number in previous sections.
For the sample data set of Exercise 1 of Section 10.2 "The Linear Correlation Coefficient"
For the sample data set of Exercise 2 of Section 10.2 "The Linear Correlation Coefficient"
For the sample data set of Exercise 3 of Section 10.2 "The Linear Correlation Coefficient"
For the sample data set of Exercise 4 of Section 10.2 "The Linear Correlation Coefficient"
For the sample data set of Exercise 5 of Section 10.2 "The Linear Correlation Coefficient"
For the sample data set of Exercise 6 of Section 10.2 "The Linear Correlation Coefficient"
For the sample data set of Exercise 7 of Section 10.2 "The Linear Correlation Coefficient"
For the sample data set of Exercise 8 of Section 10.2 "The Linear Correlation Coefficient"
For the sample data set of Exercise 9 of Section 10.2 "The Linear Correlation Coefficient"
For the sample data set of Exercise 9 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 11 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 12 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 13 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 14 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 15 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 16 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 17 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 18 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 19 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 20 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 21 of Section 10.2 "The Linear Correlation Coefficient"
For the data in Exercise 22 of Section 10.2 "The Linear Correlation Coefficient"
Large Data Set 1 lists the SAT scores and GPAs of 1,000 students.
http://www.flatworldknowledge.com/sites/all/files/data1.xls
Large Data Set 12 lists the golf scores on one round of golf for 75 golfers first using their own original clubs, then using clubs of a new, experimental design (after two months of familiarization with the new clubs).
http://www.flatworldknowledge.com/sites/all/files/data12.xls
Large Data Set 13 records the number of bidders and sales price of a particular type of antique grandfather clock at 60 auctions.
http://www.flatworldknowledge.com/sites/all/files/data13.xls