Assignment #2

PPol 604
Due: Thursday, 24 January 2013

Type up your answers. Give proper credit to those you work with and/or the text(s).

Solve the following problems. Show all of your work, but keep your answers concise. Highlight your (final) answer to distinguish it from your other numbers and text. Include a copy of your input (e.g. do file) or output (e.g. log file), when it is an appropriate way to show your work. However, do not include unnecessary output (i.e. no data dumps), and format any output so that it is easily readable. An appropriate time to include output is when you put your results in a table--if your results are wrong, then the grader has no idea how you came to your conclusions (i.e. give partial credit) unless you provide some output. Explanation includes statistical and substantive explanation (explain so that a statistical layperson can understand it, and so that a statistical analyst will see your erudition).

  1. {40} [from Golder 2010] For this problem use the data set found here. Use describe and summarize to learn about the variables in the data set. Present the models below in a table.
    a. Graph a distribution of the number of preisdential veto overrides (nover). What are some characteristics of this distribution? From summary statistics, does there appear to be overdispersion?
    b. Estimate a Poisson model of presidential veto overrides (nover), using hmargin, smargin, congexpr, govexpr, reelect, and popvote as covariates. Interpret the coefficients of hmargin, smargin, congexpr.
    c. Compute and report the squared correlation between the actual values and fitted/predicted values of nover as an R-squared measure.
    d. Test for model misspecification. What do you conclude?
    e. Calculate the expected number of veto overrides for the following situation: hmargin = -9; smargin = 2; congexp = 0; govexp = 0; reelect = 0; popvote = 52. (Why were these values chosen?) Provide confidence intervals and interpret.
    f. Calculate the expected number of veto overrides for the same situation but where govexp = 1. Calculate the change in expected counts between these two scenarios. Provide confidence intervals and interpret.
    g. Graph the relationship between the predicted number of veto overrides and the President's popular vote percentage in the last election setting other values to the median except for congexpr = 0. Interpret the graph.
    h. Graph the relationship between the predicted number of veto overrides and the President's popular vote percentage in the last election for Presidents with and without gubernatorial experience (i.e. two lines) setting other values to the median except for congexpr = 0. Interpret the graph.
    i. Graph the relationship between the predicted probability of zero veto overrides and the President's popular vote percentage in the last election for Presidents with and without gubernatorial experience (i.e. two lines) setting other values to the median except for congexpr = 0. Interpret the graph.
    j. Interpret the incident rate ratio of govexp.
    k. The number of presidential veto overrides that is possible is limited by the number of presidential vetoes (nveto) that there have been. Estimate the same model as before but take account of exposure. (You will have to drop an observation. Why?) Report the estimates in the table. Explain why we would want to consider exposure. Compare the results to the first model: What is the same and what is different? Based on these results, did we need to take account of exposure or not? How do you know?
    l. Create a variable that is the natural log of the number of presidential vetoes and include this new variable in the model (instead of using the exposure or offset option). Put the results in the third column of the table. Test to see whether we needed to take account of exposure. Interpret the coefficient on this new variable.
    m. Estimate the original specification with a Negative Binomial (NB2) model and put the results in the table. Do we have overdispersion? How do you know?
    n. Which is the best model? Why? Should we worry about sample size? Why or why not?
  2. {15} [from Wooldridge 2009] The data set is found here.
    a. "Using OLS on the full sample, estimate a model for log(wage) using explanatory variables educ, abil, exper, nc, west, south, and urban. Report the estimated return to education and its standard error" and SER.
    b. "Now estimate the equation from part (a) using only people with educ < 16. What percentage of the sample is lost? Now what is the estimated return to a year of schooling? How does it (and its standard error and SER) compare with part (a)?"
    c. "Now drop all observations with wage ≥ 20, so that everyone remaining in the sample earns less than $20 an hour. Run the regression from part (a) and comment on the coefficient (and its standard error) on educ and SER. (Because the normal truncated regression model assumes that y is continuous, it does not matter in theory whether we drop observations with wage ≥ 20 or wage > 20. In practice, including this application, it can matter slightly because there are some people who earn exactly $20 per hour.)"
    d. "Using the sample in part (c), apply truncated regression [with the upper truncation point being log(20)]. Does truncated regression appear to recover the return to education (and its standard error) and SER from the full population, assuming the estimate from (a) is consistent? Explain. Interpret the coefficient on education."
  3. {30} [from Wooldridge 2009] The data set is found here. "These are telephone survey data attempting to elicit the demand for a (fictional) "ecologically friendly" apple. Each family was (randomly) presented with a set of prices for regular apples and the eco-labeled apples. They were asked how many pounds of each kind of apple they would buy." [Public Policy 611 note: This is a type of contingent valuation survey.]
    a. "Of the 660 families in the sample, how many report wanting none of the eco-labeled apples at the set price?"
    b. "Does the variable ecolbs seem to have a continous distribution over strictly positive values? What implications does your answer have for the suitability of a Tobit model for ecolbs?"
    c. "Estimate a Tobit model for ecolbs with ecoprc, regprc, faminc, and hhsize as explanatory variables. Which variables are statistically significant?"
    d. "Are faminc and hhsize jointly significant?"
    e. "Are the signs on the coefficients on the price variables from part (c) what you expect? Explain."
    f. Test the hypothesis that the coefficient on ecoprc is the negative of the coefficient on regprc. Report your results.
    g. "Obtain the estimates of E(ecolbs|x) for all observations in the sample (i.e. fitted values). What are the smallest and largest fitted values?"
    h. "Compute and report the squared correlation between the actual values and fitted values of ecolbs. This is one possible R-squared measure."
    i. Predict the fitted value [E(ecolbs|x)] if all independent variables are set to their medians. Predict the probability [Pr(ecolbs > 0|x)] a family buys any ecologically friendly apples if all independent variables are set to their medians. Predict how many ecologically friendly apples a family buys, if they choose to buy such apples [E(ecolbs|ecolbs > 0, x)], if all independent variables are set to their medians. Interpret these predictions.
    j. Now, estimate a linear model for ecolbs using the same explanatory variables from part (c). How do the OLS estimates compare to the Tobit estimates? Using the OLS model, predict how many ecologically friendly apples a family will buy if all independent variables are set to their medians. Compare this result to the results in part (i).
  4. {15} [roughly from Golder 2006] Use the data set here. These are data from the General Social Survey, from 1975 through 1989. We will model a respondent’s liberal-conservative self-identification, i.e. ideology. However, we might be worried that selection issues are at play since not everyone answers this question.
    a. Estimate a Heckman model (by MLE). For the outcome equation, regress conservative on partyid, age, education, male, and income. Use these same variables in the selection equation and add white. Why is that a good identifying variable (or not)? Using the results, is a selection model necessary or will OLS work as well? What variables affect ideology? What variables affect whether a respondent answers the question about ideology?
    b. What is the predicted ideology (conditional on a respondent stating it) when all independent variables are set to the median?
    c. What are the marginal (or substantive) effects of income and education on ideology (conditional on a respondent stating it)? What are the marginal (or substantive) effects of income and education on the probability of stating one's ideology? (For both of these, start with all independent variables set to the median.) Interpret your results.

Back to Assignments page