Skip to content

CDCgov/covid_case_privacy_review

Repository files navigation

COVID-19 Case Privacy Review

This project contains the procedures used by the Data, Analytics, Visualization Task Force to review and verify that data sets include privacy protection controls and meet the defined k-anonymity and l-diversity thresholds established by the covid response.

Issues, questions, problems, suggestions

If you have any of the above, please submit an issue on github.

Requires

Detailed dependencies are in renv.lock, but main dependencies called out

  • R version >= 4.0.3 (although I suspect, but have not tested >=3.5)
  • sdcMicro version >= 5.5.1
  • arrow version >= 2.0.0
  • renv >= 0.12.3
  • optional for profiling
  • optional for rmarkdown

Install & run procedures

renv::restore() to install necessary packages

Then in or symlink a data file to the data/raw folder, update the script with the name of the data file, run from the R folder of this project. This script will generate output to the console and create a privacy report in the reports folder.

renv::snapshot() to update renv.lock with any changes made to dependencies

Description

These scripts are part of a two step process for statistical disclosure control implementation. These scripts do not directly perform suppression or modify any data, but are a separate step to validate that the data generation pipeline - implemented in HHSProtect as a Palantir code repository - that generates the data file so that it meets all the privacy protection requirements.

Data files are not contained in this repo and must be fetched independently and placed into the data/raw folder for review. The public use files can be retrieved from Data.CDC.gov:

K-anonymity is a technique to release person-specific data such that the ability to link to other information using the quasi-identifier is limited. Each person contained in the released data cannot be distinguished from at least k-1 other persons who share the same quasi identifiers. For example, a dataset is considered 5-anonymous it means that the smallest number of cells that share the same quasi-identifiers is 5.

While k-anonymity reduces risk of reidentification, l-diversity protects privacy by limiting the ability for finding confidential information on an individual. This technique extends k-anonymity to protect confidential information within a release so that confidential values cannot be identified to groups of individuals that share quasi-identifiers. For example, if a dataset is considered to have 2-diversity the smallest number of values shared by quasi identifiers is 2 and there are no unique values within that cell.

Thresholds are established during the privacy review procedures by subject matter experts, statisticians, informaticians, and public health scientists familiar with the case surveillance data. We evaluate the dataset directly as well as potential other data sets these data may be linked. Our goal is to reduce the risk of reidentification while providing useful data for researchers and the public to use to protect America's health.

These thresholds are applied to the quasi-identifiers and confidential variables and do not impact non-confidential variables. Records are never deleted, but individual fields will be changed from their value to NA. For quasi-identifiers, changed fields can be distinguished as they are the only NA values in those fields as missing values have been recoded to the string literal "Missing", but for confidential variables missing values and changed fields are NA to protect confidential values.

We are working to improve these privacy procedures over time and welcome feedback and improvements submitted to this project as issues or pull requests.

Data file characteristics - "COVID-19 Case Surveillance Public Use Data"

Checks its 12 variables for...

Quasi-identifiers (3)

Checked for k=5

  • age_group
  • sex
  • race_ethnicity_combined

Confidential attributes (1)

  • pos_spec_dt

Data file characteristics - "COVID-19 Case Surveillance Public Use Data with Geography"

Checks its 19 variables for...

Quasi-identifiers (8)

Checked for k=1000

  • res_state
  • res_county

Checked for k=11

  • case_month
  • res_state
  • res_county
  • age_group
  • sex
  • race
  • ethnicity
  • death_yn

Population level and geography specific checks

  • Checking county population and res_county should never be populated if the county population (by FIPS code) is under 20k.
  • Checking that sex, race, ethnicity demographic values should never be populated with the county subpopulation by those demographics is under 220 (k*20).
  • Checking for case county by sex, race, ethnicity in a county is never higher than 50% of the subpopulation by those demographics for the county.
  • Checking that there is never a situation where only a single county within a state is suppressed, allowing the state to be deduced by process of elimination.
  • Checks that if res_state is suppressed, then res_county should also be suppressed.
  • FIPS code fields are associated with quasi-identifiers so they are checked to make sure that they are always suppressed when corresponding fields are suppressed.
    • state_fips_code, suppressed when res_state is suppressed
    • county_fips_code, suppressed when res_state is suppressed

Confidential attributes

No confidential attributes are in this dataset.

Interpreting output

This script uses the sdcMicro package so much of the output is generated from this package. What we look for is the specific output linked variable violations ( 0 ), k-anon violations ( 0 ), and < 0 > l-diversity violations. If any violations are found then the file is not ready for publication, notify the data team so they can fix the data pipeline.

For the geography checks, there are multiple steps, so output should be reviewed to confirm that all steps have completed without identifying any violations that require correction prior to publishing:

  • linked variable violations ( 0 )
  • k-anon violations ( 0 ) for k=( 1000 ) and quasi-identifiers ( res_state res_county )
  • k-anon violations ( 0 ) for k=( 11 ) and quasi-identifiers ( case_month res_state res_county age_group sex race ethnicity death_yn )
  • Low population county violations ( 0 )
  • Subpopulation county violations, part 1 checking subpopulation for counties ( 0 )
  • Subpopulation county violations, part 2, checking to make sure there aren't any res_county that aren't NA but have subpops ( 0 )
  • Subpopulation population too small for cases ( 0 )
  • County/state complementary violations ( 0 )

For convenience, a portion of this output is stored in reports/log.md to compare results on previous versions of the dataset.

Analysis and Visualization files

Helper files

  • county_pop_demo_for_verify.csv a utility dataset generated from the 2019 census estimates that contain populations counts for each county by sex, race, and ethnicity. Based on a shared utility dataset made by HHSProtect, formatted to be easier to use by the verification script.
  • profile_data.R that uses the DataExplorer package to create a profile report that is helpful for understanding and debugging the dataset. If you run it, it will output a new profile to the reports folder.
  • parquet2csv.R that uses the Arrow package to read the dataset in parquet format and output an equivalent CSV file.

Structure

These files and folders are meant to help organize and make it easier for others to understand and contribute.

├── analysis                                <- analysis and visualization files
│   ├── utility_summary_public.Rmd          <- utility summary analysis for public dataset
│   ├── utility_summary_public.pdf          <- generated report from utility summary analysis
│   └── utility_summary_public_geo.Rmd      <- utility summary analysis for public geo dataset
│   └── utility_summary_public_geo.pdf      <- generated report from utility summary analysis
├── R                                       <- R scripts
│   ├── functions.R                         <- functions that are reused in other scripts
│   ├── parquet2csv.R                       <- converts a parquet file to csv
│   ├── profile_data.R                      <- creates a data profile report for exploratory data analysis
│   ├── renv.lock                           <- dependency file (generated by renv::snapshot())
│   ├── review_public.R                     <- script to review public12 data file
│   └── review_public_geo.R                 <- script to review public19 data file
├── data                                    <- data files used by project
│   └── raw                                 <- raw files, original, immutable data dump
│       └── county_pop_demo_for_verify.csv  <- census county populations by demo
├── output                                  <- output files
├── readme.md                               <- Description of project, instructions for how to run
└── reports                                 <- Generated reports and visualizations
    └── log.md                              <- logged results from reviewed files

TODOs

  • Investigate using dlookr instead of DataExplorer.

References

Public Domain Standard Notice

This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.

License Standard Notice

The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.

This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.

This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.

You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html

The source code forked from other open source projects will inherit its license.

Privacy Standard Notice

This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.

Contributing Standard Notice

Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.

All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.

Records Management Standard Notice

This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.

Additional Standard Notices

Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.