Assignment #5
Political Science 328
This assignment will be due in hard copy form in the department dropbox (outside 745 SWKT) AND uploaded on Learning Suite before 1:30 pm, Thursday, February 14. Turn in the assignment electronically on Learning Suite (separately for each part of the assignment), and on paper (in four separate documents) in the Political Science dropbox. Remember that no late assignments will be accepted.
Type your answers in a regular font (e.g. Times Roman 12). (As noted later, Stata .do files and .log files are displayed in Courier 8.)
This assignment is divided into four parts. You must submit your answers to each part separately, as we will have a different TA grade each part. Make sure that your name, section number as well as the problem set and part number (e.g. Assignment 4, Part 1) are clearly listed on each part. Students who fail to do so may be penalized on the assignment.
If necessary, re-read the section in the syllabus on group work in Academic Honesty and Plagiarism (here) to make sure you are giving proper credit to those you work with and/or the text(s) you use for each problem. As a reminder, you are in violation of this course's policies as well as the Honor Code if you are sharing electronic portions of your assignment with other people. That includes emailing other people code (even snippets of code), .do files, Word files, or anything else related to a problem set. Your assignment must represent your own work. Please work together: We encourage you to do so! But remember that when working together you should product your own independent work product.
Solve the following problems. Show all of your work, but keep your answers concise. Include a copy of your input and output: your .do file and your .log file. However, do not include unnecessary output (i.e. no data dumps), and format any output so that it is easily readable. Convert Stata output (logs and do-files) to Courier 8 with single-spacing. Explanation includes statistical and substantive explanation (explain so that a statistical layperson can understand it, and so that a statistical analyst will see your erudition). Highlight your answer.
- {30 points} [adapted from Stock]
This problem gives you an opportunity to do some calculations on the relation
between smoking and lung cancer, using a (very) small sample of five countries.
The purpose of this exercise is to illustrate the mechanics of ordinary least squares
(OLS) regression. First you will calculate the regression "by hand" using formulas
from class and the textbook, then (in the next problems) you will use Stata to confirm the calculation.
For the “by hand” calculations, you may relive history and use long multiplication,
long division, and tables of square roots; or you may use an electronic
calculator or a spreadsheet.
The data are summarized in the following table. The variables are per capita
cigarette consumption in 1930 (the independent variable, “X”) and the death
rate from lung cancer in 1950 (the dependent variable, “Y”). The cancer rates
are shown for a later time period because it takes time for lung cancer to develop and
be diagnosed.
Observation # |
Country |
Cigarettes consumed per capita in 1930 (X) |
Lung cancer deaths per million people in 1950 (Y) |
1 |
Switzerland |
530 |
250 |
2 |
Finland |
1115 |
350 |
3 |
Great Britain |
1145 |
465 |
4 |
Canada |
510 |
150 |
5 |
Denmark |
380 |
165 |
Source: Edward R. Tufte, Data Analysis for Politics and Policy, Table 3.3.
Compute the following (parts (a) through (i)). You may use a calculator, Excel or other spreadsheet programs (using no more than SUM and AVERAGE commands). Refer to the textbook for the necessary formulas; various textbook formulas are also posted on Learning Suite. (Note: Remember to show your work. Here, that means stating the formula, showing your plugged-in values in that formula, and then stating your answer. In addition, if you use a spreadsheet, attach a printout. If you calculate by hand, round to 3 digits in reporting answers, but keep all of the digits as you use calculations from a previous part in a subsequent part.)
- The sample means of X and Y.
- The standard deviations of X and Y.
- The correlation coefficient, r, between X and Y.
- b1, the OLS estimated slope coefficient from the regression $Y_{i}= \beta_{0} + \beta_{1} X_{i} + u_{i}.$
- b0, the OLS estimated intercept term from the same regression.
- $\hat{Y}_{i}, i=i,...,n$, the predicted values for each country from the regression.
- $\hat{u}_{i}$, the OLS residual for each country.
- The R2.
- The SER.
- {24} [adapted from Stock and Watson] Using the data of the previous problem, input the data into Stata. Throughout this problem, be aware of units: which values have units and which values do not.
- Calculate the same statistics as above using Stata. (Do not use the "robust" option in your regression since you only have 5 observations.) Present your results in standard equation format (e.g. standard errors in parentheses under coefficients).
- Using Stata, create a beautiful graph of the scatterplot of the five
data points and the regression line. Be sure to label the axes, the data points,
the residuals, and the slope and intercept of the regression line. (It is OK to write in some of these by hand.)
- Interpret what the coefficient values, b0 and b1, mean.
- Interpret what $\hat{Y}_{i}$ and $\hat{u}_{i}$ are, using Finland as an example.
- Interpret the SER (including its units) and R2 (including its units).
- Will the regression give reliable predictions for a country that consumes 2000 cigarettes per capita? Why or why not?
- Compute and interpret the estimated change in deaths for a country which reduces its cigarette consumption by 500 cigarettes per capita.
- Are the three assumptions in Key Concept 4.3 satisfied? Explain (each one).
-
- 3.1 {7} [adapted from Stock and Watson] Suppose that a researcher, using data on class size (CS) and average test scores from 100 third-grade classes, estimates the OLS regression:
\begin{array}{rrrl} \
\widehat{TestScore}= & 410.2 & -6.32 & \times CS, R^{2}=0.12, SER=14.2. \\
& (16.8) & (2.43) &
\end{array}
- Construct a 95% confidence interval for $\beta_{1}$, the regression slope coefficient.
- Calculate the p-value for the two-sided test of the null hypothesis $H_{0}: \beta_{1}=0$. Do you reject the null hypothesis at the 5% level? At the 1% level?
- Calculate the p-value for the two-sided test of the null hypothesis $H_{0}: \beta_{1}=-6.0$. Without doing any additional calculations, determine whether $-6.0$ is contained in the 95% confidence interval for $\beta_{1}$.
- 3.2 {10 points} Do Problem 5.2 in Stock and Watson. Add the following part:
- Explain why this regression is likely to suffer from omitted variable bias. Name a possible omitted variable, and then determine whether the regression is likely to over- or under-estimate the effect of gender on wages (i.e. what the direction of bias is for the included coefficient). You may use the formula we discussed in class to assess positive/negative bias, or Stock & Watson Equation 6.1.
- {28} [adapted from Stock and Watson] The Excel data file Growth (found on the assignment page in Learning Suite) contains data on average growth rates from 1960 through 1995 for 65 countries along with variables that are potentially related to growth. (A detailed description is also found on the assignment page in Learning Suite.) In this problem, you will investigate the relationship between growth and trade. Import the data into Stata. [Note: Make sure you can import the data by yourself for this problem, or you will have difficulties on the exams.] Throughout this problem, be aware of units: which values have units and which values do not.
- Construct a scatterplot of average annual growth rate (Growth) on the average trade share (TradeShare). Does there appear to be a relationship between the variables?
- One country, Malta, has a trade share much larger than the other countries. Find Malta on the scatterplot. Does Malta look like an outlier?
- Using all observations, run a regression of Growth on TradeShare, using the "robust" option. Present your results in standard equation format. What is the estimated intercept? What is the estimated slope? Interpret the estimated slope. Is the slope coefficient statistically significantly different from zero at the 5% significance level? What is the p-value? Show how you reach this conclusion.
- Estimate the same regression, excluding the data from Malta. Present your results in standard equation format. Answer the same questions in (c).
- Plot the estimated regression functions from (c) and (d). Using the scatterplot in (a), explain why the regression function that includes Malta is different than the regression that excludes Malta.
- Why is the Malta trade share so large? Explain why Malta should be included in or excluded from the analysis.
- Using all observations, report and interpret the 95% confidence interval for the slope of the population regression line.
- Using all observations, use the regression to predict the growth rate for a country with a trade share of 0.5. Similarly, using all observations, use the regression to predict the growth rate for a country with a trade share equal to 1.0. Calculate the confidence interval for the predicted effect of increasing trade share by 0.5. (Hint: See Stock & Watson Equation 5.13.)
- Using all observations, what is the R2 of this regression? What does this mean?
- Using all observations, compute the correlation coefficient between Growth and TradeShare,
and compare its square to the R2. How are the correlation coefficient and the R2 related?
- Using all observations, what is the value of the standard error of the regression? What does this mean?
- {1} Complete the Time Spent Survey. State your survey completion code at the top of your Part 4 packet (next to your name, section, etc.).