Assignment #11

PPol 603
Due: Thursday, 6 December 2012

Type up your answers. Give proper credit to those you work with and/or the text(s).

Solve the following problems. Show all of your work, but keep your answers concise. Highlight your (final) answer to distinguish it from your other numbers and text. Include a copy of your input (e.g. do file) or output (e.g. log file), when it is an appropriate way to show your work. However, do not include unnecessary output (i.e. no data dumps), and format any output so that it is easily readable. An appropriate time to include output is when you put your results in a table--if your results are wrong, then graders have no idea how you came to your conclusions (i.e. give partial credit) unless you provide some output. Explanation includes statistical and substantive explanation (explain so that a statistical layperson can understand it, and so that a statistical analyst will see your erudition).

  1. {50 points} [modified from Stock and Watson 2007, E11.1 and E11.2] "It has been conjectured that workplace smoking bans induce smokers to quit by reducing their opportunities to smoke. In this assignment you will estimate the effect of workplace smoking bans on smoking using data on a sample of 10,000 U.S. indoor workers from 1991-1993, available on the textbook Web site www.aw-bc.com/stock_watson in the file Smoking. The data set contains information on whether individuals were or were not subject to a workplace smoking ban, whether the individuals smoked, and other individual characteristics. A detailed description is given in Smoking_Description, available on the Web site." Report the results in a table similar to Problem 3 of Assignment 10, where you report the coefficient and standard error for smkban, and not for other variables (but you do report whether those variables are included or not, as well as model statistics).
    a. Estimate a probit model with smoker as the dependent variable and smkban as a regressor. How does a workplace smoking ban affect smoking? Is smkban statistically significant?
    b. Estimate a probit model with smoker as the dependent variable and the following regressors: smkban, female, age, age2, hsdrop, hsgrad, colsome, colgrad, black, and hispanic. How does a workplace smoking ban affect smoking? Is smkban statistically significant? Compare the estimated effect of a smoking ban from this regression with your answer from (a). Suggest a reason, based on the substance of this regression, explaining the change in the estimated effect of a smoking ban between (a) and (b).
    c. Test the hypothesis that the probability of smoking does not depend on the level of education in the probit model of (b). Does the probability of smoking increase or decrease with the level of education?
    d. Discuss the fit of the two models generally using chi2, pseudo-R2, percentage correctly predicted, and proportional reduction of error.
    e. Mr. A is white, non-Hispanic, 20 years old, and a high school dropout. Using the probit regression from (b), and assuming that Mr. A. is not subject to a workplace smoking ban, calculate the probability that Mr. A smokes. Carry out the calculation again assuming that he is subject to a workplace smoking ban. What is the effect of the smoking ban on the probability of smoking?
    f. Repeat (e) for Ms. B, a female, black, 40-year-old, college graduate.
    g. Repeat (e) and (f) using a linear probability model (using the same independent variables used in b.).
    h. Based on the answers to (e)-(g), fill in the table provided here. Note that the table uses full variable names, something you should use in professional reports. Do the probit and linear probabiility model results differ? If they do, which results make more sense? Are the estimated effects large in a real-world sense?
    i. Based on the regression in (b), is there a nonlinear relationship between age and the probability of smoking? Plot the relationship between the probability of smoking and age for a white, non-Hispanic male college graduate with no workplace smoking ban.
    j. Are there important remaining threats to internal validity?
  2. {50} [from Hillygus and Shields 2005 via Glynn 2007] For this problem, we will use the dataset that is posted here. The data set is from a "nationally representative and randomly sampled" post-2004-election survey, and includes the following variables: a. Begin by reporting sample means for bush and ideology in the dataset in a table. You should report the mean after breaking the data into four groups: African-American men, African-American women, non-African-American men, non-African-American women. Include the sample size for each group as well. Where are there systematic differences in these two covariates across these four groups (and where not)?
    b. For each of the four groups in part a., calculate the probability that a voter in that group voted for Bush in 2004. Convert those probabilities to odds and report the odds. Calculate the ratio of odds for each other group to the odds for non-African American men. In other words, fill in the following table (on your own paper):
    Vote for Bush non-AA men     AA men     non-AA women     AA women
    Probability of vote for Bush
    Odds of vote for Bush
    Odds ratio relative to non-AA men NA
    Interpret the quantities in this table.
    c. Fit a logistic regression model using afram, female, afram×female (and an intercept): Report the results from your regression in a table with three columns: coefficient, robust standard error, and odds ratio. Compare the estimates in the odds ratio column to estimates that you generated in part b.; what does each of these quantities represent (i.e. interpret)? Can you reject the null hypothesis that the odds of voting for Bush are the same for African American and non-African American men?
    d. Obtain the predicted probabilities of voting for Bush for the four categories in the model. Compare the predictions to the probabilities that you generated in part b.
    e. Add ideology to the model that you fit in part c. of Problem 1 and estimate the new model. Report the results of the two models in one table with one column for each model. Report the odds ratio on ideology and a 95% confidence interval for this odds ratio. Substantively interpret this odds ratio. How does the effect of ideology on the probability of voting for Bush depend on the other characteristics of the respondent? Intuitively, why does this occur?
    f. Discuss the fit of the two models generally using chi2, pseudo-R2, percentage correctly predicted, and proportional reduction of error.
    g. Create a plot showing the predicted probability and its uncertainty of voting for Bush as a function of ideology for non-African-American men. Do the same for African American men. Create a plot showing the predicted probabilities (but not uncertainty) of these two categories on the same graph. Explain what you observe in these three plots.
    h. Using the model including ideology, test for multicollinearity (using vif) and show your results.
    i. Using the model including ideology, test for misspecification (functional form or omitted variables using linktest) and show your results.
    j. Using the model including ideology, test for outliers (using deviance residuals). What are the characteristics of the person(s) that fit the model worst?
    k. Using the model including ideology, test for influential observations (using dbeta). What are the characteristics of the person(s) that are most influential in the model?
    l. Using the model including ideology, discuss the strengths and weaknesses of this model (i.e. discuss internal and external validity).

Back to Assignments page