Assignment #1
Political Science 328
This assignment will be due in hard copy form in the department dropbox (outside 745 SWKT) AND uploaded on Learning Suite before 1:30 pm, Thursday, January 17. Turn in the assignment electronically on Learning Suite (separately for each part of the assignment), and on paper (in four separate documents) in the Political Science dropbox. Remember that no late assignments will be accepted.
Type your answers in a regular font (e.g. Times Roman 12). Display Stata .do files and .log files in Courier 8.
This assignment is divided into four parts. You must submit your answers to each part separately, as we will have a different TA grade each part. Make sure that your name, section number as well as the problem set and part number (e.g. Assignment 1, Part 1) are clearly listed on each part. Students who fail to do so may be penalized on the assignment.
If necessary, re-read the section in the syllabus on group work in Academic Honesty and Plagiarism (here) to make sure you are giving proper credit to those you work with and/or the text(s) you use for each problem. For each problem state with whom you worked. If you did not work with anyone, state so explicitly. As a reminder, you are in violation of this course's policies as well as the Honor Code if you are sharing electronic portions of your assignment with other people. That includes emailing other people code (even snippets of code), .do files, .word files, or anything else related to a problem set. Your assignment must represent your own work. Please work together: we encourage you to do so! But remember that when working together you should produce your own independent work product.
Solve the following problems. Show all of your work, but keep your answers concise. Include a copy of your input and output: your .do file and your .log file. However, do not include unnecessary output (i.e. no data dumps), and format any output so that it is easily readable. Convert Stata output (.log and .do files) to Courier 8 with single-spacing. For the third part, showing your work would include steps such as stating the equation you are using, where you got the equation, and the steps you took to obtain your answer. Explanation includes statistical and substantive explanation (explain so that a statistical layperson can understand it, and so that a statistical analyst will see your erudition). Highlight your answer.
- {25 points} Stata basics.
- What is a .do file used for? How do you create it? What does it mean to "run" a .do file? Create a .do file.
- Explain what the different colors of text mean in a .do file.
- What is a log file used for? How do you begin and end a log file? Create a .log file (in the .do file).
- What is the most important difference between a .do file and a log file?
- With a log file open, open the data editor and manually enter data into the first two columns. Enter the number 1-9 in the cells of column 1 and the numbers 10-18 in the cells of column 2. (Note: We can see you enter these by looking at your log file.)
- Note the difference between the data editor and the browser and how to toggle back and forth. What are the two different modes?
- Clear the data from your Stata center. (Put the command you use into your .do file.)
- Online, go to http://wps.pearsoned.com/aw_stock_ie_3/178/45691/11696965.cw/index.html. On the left of the page, click on "Data for Empirical Exercises and Test Bank". Click on "Data for Empirical Exercises and Test Bank (Updated Edition)". Download the Guns Data (Stata Dataset); that is, save the file to your computer. Open this dataset in Stata. Using that dataset, answer the following questions.
- How many observations are in this dataset? (Again, put the command you use into your .do file. Continue to do this in the subsequent questions.)
- Explain what the variable incarc_rate means? There are two ways to get this information. State the command that gives you this information (and put it in your .do file as well), and describe the other location where you can find it.
- In the data editor, reorder the variables so that stateid and year are the first and second variables, respectively, and the rest of the variables follow in any order. There are two ways to accomplish this. Write the command you can use (and put it in your .do file). Also describe how to manually order the variables.
- Sort the data by stateid and year (and put the command in your .do file). Open your data browser: how has the data changed? How is it now ordered?
- Describe how to find the mean of violent crime rates using the command and the drop-down menus. What is the command you would use. (Put this in your .do file.) What are the (list of) menus in the order that you would click to find this information?
- What is the average violent crime rate in state 1? (Include the command you used in your .do file.) [Hint: Remember to use a condition in your command.]
- The variable shall is coded 1 if the state has a law allowing citizens to carry concealed weapons in that year and 0 if the state does not have this law. In 1993, what percentage of states had a shall carry law? (Once again, place the command in your .do file. And remember to think of conditional commands.)
- The variable pw1064 is the percent of the state population that is white, ages 10 to 64. Place a new name and label on this variable so any person viewing your dataset will know what this variable is. There are two ways of renaming and relabeling. Please provide the two commands below, and explain the process by which you would manually change this information.
- Give proper credit below to those you worked with and/or any texts. If you worked alone, state that explicitly.
- Remember to close your log (and place the command that closes your log in the .do file). Remember to include your .do file and .log file along with your answers. This is often done by cutting and pasting the file into a Word document (and changing the font to Courier 8).
- {25}
- 2.1 Stata Visualizations: Download the College Distance (Stata Dataset) from the Stock and Watson site, under Additional Empirical Exercises. Re-order the dataset, if you choose. Open a new .do and a new .log file. Remember to give proper credit to those you worked with and/or any texts. If you worked alone, state that explicitly.
- Create a histogram of test scores (bytest) for all students. Remember to measure by frequency. Paste the histogram into your answers (and write the command you used in your .do file). Cut the number of bins in half and run it again (and write the command in your .do file and include the histogram graphic in your answers). Which do you like better? Why?
- Create a kernel density plot of test scores (bytest) for all students. Include the graphic in your answers (and write the command you used in your .do file). Cut the bandwidth in half and run it again, writing the command below. Which do you like better? Why?
- Using the plots from the previous two questions, what can you conclude about the shape of the distribution? What is the range of test scores that occurred most frequently?
- Create two box plots of test scores, one for each gender. (Again, paste in the two graphics in your answers and write the commands you used in your .do file.) Are there any
differences? Create another set of box plot of test scores, for black and non-black. (Paste; and write.) Are there any differences?
- Create a crosstab of black vs. urban. Copy and paste the crosstab output into your answers (use Courier 8) and include the command in your .do file. Is there any relationship? How might this affect the relationship in the previous part?
- Using the help function in Stata, provide 3 commands that you could use that involve the drop command. Please describe what each of the commands do.
- Run a scatterplot of test score and education. Which is the independent (explanatory) variable and which is the dependent variable (response) variable? [Hint: Check the documentation of the data set on the web site.] (Paste; and write.) Now jitter the data: Does it help? What does the relationship look like? (Paste; and write.)
- Run a lowess of test score and education. What does the relationship look like? (Paste; and write.) Cut the bandwidth in half and run it again. Does it help? (Paste; and write.)
- Run a graph matrix of test scores, county unemployment rate (cue80), and distance to college (in 10s of miles). What relationships do you see? (Paste; and write.)
- Run three separate lowess graphs on the three variables above, two at a time. For each graph, explain why you choose which variable as the dependent (response) variable. (That is, give the causal logic.) What are the relationship you observe between the variables? (Paste three graphics; write commmands in the .do file.)
- Run your entire .do file (To do this, clear your dataset from the Stata command center. Then highlight all the text in this .do file, and press Ctrl-d. [Or, click the Execute button on the .do file toolbar above (the button with the paper and the arrow located on the far right).] You will know that your syntax was correct if your .do file runs properly without error. Always check to make sure your .do file executes properly so you have a good record of all your commands. Keeping the .do file will save you a lot of time if done properly.)
- 2.2 Equations: Practice writing equations correctly by rewriting the following equations using the equation editor on Microsoft Word:
a. $\mu = \frac{\sum x_i}{N}$
b. $s^{2} = \dfrac{\sum_{i=1}^{N} (x_i - \bar{x})^{2}}{N-1}$
c. $Y_{i}= \beta_{0} + \beta_{1} X_{i} + u_{i}.$
d. $D_{i} = \dfrac{\sum (\hat{y}_i - \hat{y}_{j(i)})^{2}}{(k+1) {\hat{\sigma}}^2}$
- {30}
- 3.1 [from Stock & Watson] Frequencies given: We are doing this problem so that given a table of data, we can calculate the difference in the unemployment rate of college graduates and non-college graduates, or the difference in the fatality rate of bicycle accidents whether the bicyclist was wearing a helmet or not (and calculate how many bicyclists are wearing helmets), or the difference in how frequently democracies go to war compared to non-democracies. It is also to get us thinking about how one variable depends on another variable.
You are employed by a think-tank that wants to argue that college is irrelevant (they think too many people are going to college). To help answer the question, your organization collected the data in the Table below. This table presents the joint probability distribution by employment status and college graduation for a recent year. Based on the information in the table, can you find anything to support the idea that college is irrelevant to employment status? Follow the steps below to help figure this out. Remember to show your work for any calculations.
|
Unemployed (Y=0)
|
Employed (Y=1)
|
Total
|
Non-college grads (X=0) |
0.026 |
0.576 |
0.602 |
College grads (X=1) |
0.009 |
0.389 |
0.398 |
Total |
0.035 |
0.965 |
1.000 |
a. Compute E(Y).
b. The unemployment rate is the fraction of the labor force that is unemployed. Show that the unemployment rate is given by 1-E(Y).
c. Calculate E(Y|X=1) and E(Y|X=0).
d. Calculate the unemployment rate for college graduates and non-college graduates.
e. A randomly selected member of this population reports being unemployed. What is the probability that this worker is a college graduate? A non-college graduate?
f. Are educational achievement and unemployment status independent? That is, is college is irrelevant to employment status? Use an equation, and explain.
- 3.2 Calculated frequencies: The following problem is often encountered in the case of a rare disease, say AIDS, when determining the probability of actually having the disease after testing positively for HIV. (This is often known as the accuracy of the test given that you have the disease.) Let us set up the problem as follows: Y = 0 if you tested negative using the ELISA test for HIV, Y = 1 if you tested positive; X = 1 if you have HIV, X = 0 if you do not have HIV. We will do this for the state of Utah: According to AidsVu, "In 2015, 116 of every 100,000 people were living with diagnosed HIV." In other words, 0.116 percent of the Utah population has HIV. (Note: NOT 11.6%, but 0.116%.) The ELISA test is 99.7 percent accurate when you have HIV, and 98.5 percent accurate when you do not have HIV. (The complements of these are the false negative and false positive, respectively.) For parts a.-d., round to the nearest integer. For part e., calculate to the 6th decimal place.
a. We will start with frequencies, rather than probabilities. Assume the population of Utah is 3,000,000 (which is pretty close to the Census estimate). Use this table as a guide, but make your own table. Calculate and fill in the column totals (far right column).
|
Test Positive (Y=1)
|
Test Negative (Y=0)
|
Total
|
HIV (X=1) |
|
|
|
No HIV (X=0) |
|
|
|
Total |
|
|
3,000,000 |
b. Use the conditional probabilities to calculate and fill in the joint frequencies (middle four cells).
c. Calculate and fill in the marginal frequencies for testing positive and negative (bottom row, middle cells). Calculate the conditional probability of having HIV when you have tested positive. Explain this result.
d. Calculate the conditional probability of having HIV when you have tested negative.
e. Create a similar table using joint and marginal probabilities instead of frequencies. Which would you show to a statistical layperson?
- {20} Data Analysis: Yale University political scientists Daniel Butler and David Broockman have recently published an article, "Do Politicians Racially Discriminate against Constituents? A Field Experiment on State Legislators." Their research consisted of sending a fictitious e-mail message to approximately 4,800 state legislators with a request for assistance in registering to vote. Some of the messages were sent using an apparently white name (Jake Mueller) and some under a stereotypical black name (DeShawn Jackson). They then waited to see whether or not legislators responded to these emails. They were interested in not only whether or not the different names (Jake vs. DeShawn) affected whether or not legislators responded, but also whether or not the party and ethnicity of the legislator also had an effect.
Download the replication dataset from Learning Suite (Butler_Broockman.dta). Record your answers to the following questions in a .log and .do file in STATA. (Remember to turn in both the .log file and the .do file for this portion of the assignment.)
a. Look at your data! After reading the rest of the problem, visualize your data for variables used below with tools learned in class and lab (and the earlier .do file). Include the graphics and/or tables you use in your answer.
b. Using Stata, how many legislators received the DeShawn treatment condition?
c. How many legislators received the Jake treatment condition?
d. What is the overall probability that a legislator responded to an email?
e. What is the probability that a legislator responded to an email, conditional on that legislator receiving an email from DeShawn?
f. What is the probability that a legislator responded to an email, conditional on that legislator receiving an email from Jake?
g. What is the probability that a legislator responded to an email, conditional on that legislator being a Democrat and receiving an email from DeShawn?
h. What is the probability that a legislator responded to an email, conditional on that legislator being a Republican and receiving an email from DeShawn?
- {1} Complete the Time Spent Survey. State your survey completion code at the top of your Part 4 packet (next to your name, section, etc.).