Why De-identification Matters
De-identified health information is not PHI and therefore is not subject to HIPAA's use and disclosure restrictions. Organizations can use and share de-identified data for research, analytics, quality improvement, public health reporting, and other purposes without individual authorization, without minimum necessary limitations, and without most other HIPAA constraints. This makes proper de-identification a powerful tool for unlocking the value of health data while protecting individual privacy.
HIPAA provides two methods for achieving de-identification. Both must be applied rigorously — partial de-identification that leaves re-identification risk does not produce de-identified data under HIPAA.
Method 1: The Safe Harbor Method
The Safe Harbor method requires removal of all 18 categories of identifiers specified in the HIPAA Privacy Rule:
- Names
- Geographic subdivisions smaller than a state (with the zip code exception for areas with populations over 20,000)
- All dates (except year) for dates directly related to an individual, and all ages over 89
- Telephone numbers
- Fax numbers
- Email addresses
- Social Security numbers
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate/license numbers
- Vehicle identifiers and serial numbers
- Device identifiers and serial numbers
- Web URLs
- IP addresses
- Biometric identifiers
- Full-face photos and comparable images
- Any other unique identifying number, characteristic, or code
Additionally, the covered entity must have no actual knowledge that the remaining information could be used alone or in combination with other information to identify the individual. The Safe Harbor method is straightforward to apply but may remove more data than necessary, potentially limiting the analytical value of the resulting dataset.
Method 2: Expert Determination
The Expert Determination method allows a qualified statistical or scientific expert to certify that the risk of re-identifying any individual in the dataset is very small. The expert must apply generally accepted statistical and scientific principles to analyze the data, and must document the methods and results of the analysis. The covered entity must retain the expert's documentation.
Expert determination can produce more analytically useful datasets than safe harbor because it allows retention of information — such as less-common zip codes or specific date ranges — that safe harbor would require removing, as long as the expert certifies that re-identification risk remains very small. The trade-off is the cost and complexity of engaging a qualified expert and the need to re-analyze the data if its composition changes.
Re-identification Risk
De-identification does not guarantee permanent anonymity. Research has repeatedly demonstrated that de-identified datasets can sometimes be re-identified by linking them with other publicly available data sources. Latanya Sweeney's research showed that 87% of Americans could be uniquely identified using just three data points: zip code, birthdate, and sex — all of which are removed under safe harbor de-identification.
Organizations using de-identified data should implement policies against re-identification, restrict access to the original dataset and the de-identified dataset, and evaluate re-identification risk when the composition of the dataset or the availability of external linking data changes. A covered entity that re-identifies de-identified data — or that allows others to do so — may find that the information was never truly de-identified under HIPAA if re-identification was reasonably foreseeable.
Limited Data Sets
Between fully identified PHI and fully de-identified data lies the "limited data set" — a dataset from which certain direct identifiers have been removed but which may still contain geographic information (city, state, zip code) and dates. Limited data sets may be used for research, public health, and healthcare operations purposes under a data use agreement (DUA), which restricts use and prohibits re-identification. Limited data sets are still PHI and still require a DUA, but the authorization and minimum necessary requirements are somewhat relaxed.