Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Solving healthcare's big data analytics security conundrum

Brian Eastwood | Oct. 8, 2013
HIPAA understandably makes it hard for organisations to obtain personal health information and even harder to use that information for the purpose of data analysis. Empowering patients to own and share their own data -- and then assuring them that it's being properly de-identified -- can ease this process.

Put that information in a patient's hands, Harlow says, and those who are willing are empowered to share it. Go beyond just the Blue Button, as patient advocates suggest, and it could be possible for patients to be more specific about who gets what information: A Green Button for anonymized data, for research purposes, or a White Button for encrypted PHI.

"Can there be a parallel universe of sharing, [of] providing information that can be analyzed ... and also shared on an individual basis?" Harlow asks. "Let's create this critical mass.

Anonymizing Health Data for HIPAA-Compliant Analysis
Of course, once an entity has the data, it needs to be de-identified before it can be analyzed.

HIPAA's de-identification standard, spelled out in the HIPAA Privacy Rule, give an organization two options:

  • Expert determination applies "generally accepted statistical and scientific principles for rendering information not individually identifiable" in such a way that "the risk is very small" that a person could be re-identified.
  • Safe harbor removes 18 specific identifiers that range from name, address and phone number to license plate number and IP address.

Most of the data that's covered under safe harbor is a direct, or unique, identifier, says Khaled El Emam, CEO of data anonymization system vendor Privacy Analytics, and would therefore be removed from a data set prior to analysis anyway. (El Emam and Harlow spoke at the recent Strata Rx conference in Boston.) What needs de-identification, then, are the quasi-identifiers - the bits of information that can't identify a person on their own but can when combined with other data.

This can get tricky. To combat it, El Emam describes how the State of Louisiana CajunCodeFest de-identified the 6.7 million Medicaid claims and 4 million immunization records it used in its recent hackathon.

Say you're looking at a large patient population. The vast majority will visit a hospital only once or twice a year, but that data set will include a small minority in the long tail who made many visits. The same is true for claims data: Most patients will file but a handful of claims annually, but those in the long tail could file hundreds. Here, you likely need to "truncate the tail" so those individuals don't stand out, El Emam says.

Pay attentions to the dates that a patient's claims were filed, too, El Emam says. Randomizing the sequence of dates could shift the order and suggest, for example, that a person was admitted the fifth time before he was admitted the fourth time. A fixed shift for a set of dates will keep the intervals intact, but it's hard to make the case that that actually de-identifies the data. In this case, randomized generalization — converting all dates as intervals from a first "anchor date," then randomizing the intervals within a seven-day range — will add noise but maintain order, El Emam says.

 

Previous Page  1  2 

Sign up for CIO Asia eNewsletters.