Understanding Spatial Regression Models from a Weighting Perspective in an Observational Study of the Impact of Superfund Cleanups on Birth Weight
This repository contains the code for the simulation and data application sections of the paper. The workflow is:
funcs.Rcontains all utility functions for the simulation and data application.simulationcontains the code for the simulation section of the paper. The simulation is fully reproducible.applicationcontains the code for the data application section of the paper.
We link publicly available U.S. Environmental Protection Agency information on 1,429 Superfund sites with high-resolution 1990 Census covariates and Medicaid birth outcomes from 2016–2018. We define a binary exposure indicating whether a site was remediated between 1991 and 2015. For each site and its 2-km buffer, we assemble eleven sociodemographic covariates, five indicators of contaminant types, and site scores. Exposure and confounders are publicly available on Harvard Dataverse at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/EKSCCU. Birth outcomes (the percentages of small vulnerable newborns and congenital anomalies) are aggregated from zip code tabulation area-level Medicaid claims, yielding a final analytic sample of 1,079 Superfund sites. Medicaid claims data is stored on the Harvard Regulated Data Environment and is not publicly available.
- U.S. Census Bureau (1991), 1990 Decennial Census, Summary Tape Files 1 and 3A (STF1 and 3A).
- Jonathan Schroeder, David Van Riper, Steven Manson, Katherine Knowles, Tracy Kugler, Finn Roberts, and Steven Ruggles. IPUMS National Historical Geographic Information System: Version 20.0 [dataset]. Minneapolis, MN: IPUMS. 2025. http://doi.org/10.18128/D050.V20.0
- U.S. Environmental Protection Agency (2025). Superfund National Priorities List (NPL) Where you Live Map. ArcGIS Feature Service. https://epa.maps.arcgis.com/apps/webappviewer/index.html?id=33cebcdfdd1b4c3a8b51d416956c41f1
- U.S. Environmental Protection Agency (2025). Contaminant of Concern Data for Decision Documents by Media, FYs 1981-2024 (Final NPL, Deleted NPL, and Superfund Alternative Approach Sites), https://www.epa.gov/superfund/superfund-data-and-reports.
- U.S. Environmental Protection Agency, Office of Mission Support (2025). NPL Superfund Site Boundaries (EPA Public), https://www.arcgis.com/home/item.html?id=d6e1591d9a424f1fa6d95a02095a06d7.
- Centers for Medicare & Medicaid Services (2020). Medicaid and CHIP T-MSIS Analytic Inpatient file. Restricted data containing protected health information; not publicly accessible.
We are grateful to Shreya Nalluri, Amruta Nori-Sarma, Mary Willis, and Flannery Black-Ingersoll for their thoughtful discussions on Medicaid birth outcome data. The simulations in this paper were run on the FASRC Cannon cluster, supported by the FAS Division of Science Research Computing Group at Harvard University. The outcome analysis was run on Regulated Data (ReD) Environment, operated by Harvard University Information Technology and supported by the Office of the Vice Provost for Research at Harvard University.
The paper associated with this repository is available on arXiv: "Understanding Spatial Regression Models from a Weighting Perspective in an Observational Study of Superfund Remediation" (Woodward, Dominici, Zubizarreta 2025).
To reproduce the results in the paper, please follow these steps.
- Download publicly available data:
- Download 1990 census tract shapefiles and the following variables from the 1990 Decennial Census via NHGIS: total population, population of individuals identifying as Hispanic, Black, Indigenous, or Asian, median household income, median house value, percentage below the poverty line, tenure status, population with a high school diploma, and the median year of housing construction. See codebooks in
data/nhgis0002_csvfor specific variable names. Save all files in thedata/folder.
- Download 1990 census tract shapefiles and the following variables from the 1990 Decennial Census via NHGIS: total population, population of individuals identifying as Hispanic, Black, Indigenous, or Asian, median household income, median house value, percentage below the poverty line, tenure status, population with a high school diploma, and the median year of housing construction. See codebooks in
- Run
application/01_chemical_medium_scrape.Rto process chemical and medium information from the contaminants of concern file. - Run
application/02_preprocessing.Rto web scrape remediation dates and site scores from EPA Superfund site profiles, extract Superfund site boundary information from EPA ArcGIS, and merge with 1990 Census tract covariates and the chemical/medium information from step 1. After area weighting, this script produces the Superfund site-level dataset of binary exposure and covariates calledbufferswhich can be accessed by executingload('data/preprocessed_superfunds.RData'). - Run
application/03_diagnostics.Rto produce Figures 1, 2, 7, and 8. - Run
application/04_balanace_gamma_diagnostics.Rto produce Figure 6. - Run
application/05_random_fixed_effects.Rto produce Figure 5. - Run
application/06_outcome_analysis.Rto produce Figures 3, 4, and 12 as well as Tables 1, 3, 5, 6, and 7. - Run
simulation/01_simulation_preprocessing.Rto preprocess the data needed for the simulation study and generate Figure 10. - Execute
sbatch submit_jobs.shto run the simulation analyses on a computing cluster. This will generate results as CSV files in theresults_Oct30/folder. - Run
simulation/03_analysis.Rto produce Table 2 and Figure 9.
If you have any questions, please contact us at swoodward@g.harvard.edu.