Page Menu

Data De-identification

De-identification is a tool that organizations can use to remove personal information from data that they collect, use, archive, and share with other organizations.

De-identification is not a single technique, but a collection of approaches, algorithms, and tools that can be applied to different kinds of data with differing levels of effectiveness. In general, privacy protection improves as more aggressive de-identification techniques are employed, but less utility remains in the resulting dataset.

De-identification is especially important for government agencies, businesses, and other organizations that seek to make data available to outsiders. For example, significant medical research resulting in societal benefit is made possible by the sharing of de-identified patient information under the framework established by the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule, the primary US regulation providing for privacy of medical records.

As long as any utility remains in the data derived from personal information, there also exists the possibility, however remote, that some information might be linked back to the original individuals on whom the data are based. When de-identified data can be re-identified the privacy protection provided by de-identification is lost. The decision of how or if to de-identify data should thus be made in conjunction with decisions of how the de-identified data will be used, shared or released, since the risk of re-identification can be difficult to estimate.

Risks to individuals can remain in de-identified data. These risks include allowing inferences about individuals in the data without re-identification, and impacts on groups represented in the data.

The HIPAA Privacy Rule states that once data has been de-identified, covered entities can use or disclose it without any limitation. The information is no longer considered PHI, and does not fall under the same regulations and restrictions as PHI.  It is important to note that UMass Chan is NOT a covered entity and can therefore not disclose de-identified data.

Some common direct identifiers of the individual or of relatives, employers, or household members of the individual that a data set cannot include if wanting to be categorized as de-identified are:

Direct Identifiers:

  • Names
  • All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:
    (1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and
    (2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000
  •  All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
  • Telephone numbers
  • Vehicle identifiers and serial numbers, including license plate numbers
  • Fax numbers
  • Device identifiers and serial numbers
  • Email addresses
  • Web Universal Resource Locators (URLs)
  • Social security numbers
  • Internet Protocol (IP) addresses
  • Medical record numbers
  • Biometric identifiers, including finger and voice prints
  • Health plan beneficiary numbers
  • Full-face photographs and any comparable images
  • Account numbers
  • *Any other unique identifying number, characteristic, or code
  • Certificate/license numbers
  • The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.

* unique identifying number, characteristic, or code

Identifying Number:
There are many potential identifying numbers.  For example, the preamble to the Privacy Rule at 65 FR 82462, 82712 (Dec. 28, 2000) noted that “Clinical trial record numbers are included in the general category of ‘any other unique identifying number, characteristic, or code.’

Identifying Code:
A code corresponds to a value that is derived from a non-secure encoding mechanism.  For instance, a code derived from a secure hash function without a secret key (e.g., “salt”) would be considered an identifying element.  This is because the resulting value would be susceptible to compromise by the recipient of such data. As another example, an increasing quantity of electronic medical record and electronic prescribing systems assign and embed barcodes into patient records and their medications.  These barcodes are often designed to be unique for each patient, or event in a patient’s record, and thus can be easily applied for tracking purposes.  See the discussion of re-identification.

Identifying Characteristic:
A characteristic may be anything that distinguishes an individual and allows for identification.  For example, a unique identifying characteristic could be the occupation of a patient, if it was listed in a record as “current President of State University.”


Indirect Identifiers

In addition to the direct identifiers listed above, indirect identifiers may still remain.  Indirect identifiers are data elements that may make it possible to identify an individual deductively.  Examples of indirect identifiers include:

  • Gender
  • Race
  • Ethnicity
  • Religion
  • Age
  • Marital Status
  • Household Composition
  • # of Children
  • Place of Birth
  • Education
  • Major
  • Income
  • Job Title
  • Place of Work
  • Medical Condition
  • Dates (of graduation, arrest, marriage)
  • Uncommon Characteristics
  • Direct Identifiers of Household Members

It is important to note that the more indirect identifiers investigators collect, the higher the risk of re-identification. In addition, the investigator must keep in mind that there may be information publicly available on the subjects that are unconnected from the data collected for the specific research in question and cumulatively, this information may be used to re-identify the subjects.