Approaches to storing, managing, analyzing and mining Big Data are new, introducing security and privacy challenges within these processes. Big Data transmits and processes an individual's PII as part of a mass of data--millions to trillions of entries--flowing swiftly through new junctions, each with its own vulnerabilities.
Deidentification masks PII, separating information that identifies someone from the rest of his or her data. The hope is that this process protects people's privacy, keeping information that would kindle biases and other misuse under wraps.
Reidentification science, which pieces PII back together reattaching it to the individual thwarts deidentification approaches that would protect Big Data, making it unrealistic to believe that deidentification can really maintain the security and privacy of personal information in Big Data scenarios.
Vulnerabilities, Exposure and Deidentification
Enterprises manage Big Data using large, complex systems that must execute hand-offs from system to system. "Typically an ETL procedure (extract, transfer, load) loads Big Data from a traditional RDBMS data warehouse onto a Hadoop cluster. Since most of that data is unstructured, the system runs a job in order to structure the data. Then the system hands it off to a relational database to serve it up, to a BI analyst, or to another data warehouse running Hadoop for storage, reference, and retrieval," explains Brian Christian, CTO, Zettaset. Any Big Data hand-offs or moves cross vulnerable junctions.
Creators of Big Data solutions never intended many of them to do what they do today. Take map reduce, for example. "Google invented map reduce to store public links so people can search them," says Christian. There were no worries about security because these were public links. Now enterprises use map reduce and NoSQL systems on medical and financial records, which should remain private. Because security is not inherent, enterprises and vendors have to retrofit these systems with security. "That's a big problem," says Christian, "vendors did not design firewalls and IDS for distributed computing architectures." These architectures tend to scale up to extremes beyond what traditional firewalls and IDS can natively address.
According to the Stanford Law Review article, vulnerabilities that expose PII subject people to scrutiny, raising concerns about acts of profiling, discrimination and exclusion based on an individual's demographics. These abuses can lead to loss of control for the individual. While brands use PII to market to customers to their benefit, those same vendors as well as law enforcement, government agencies and other third parties could also interpret and apply that personal data to the individual's detriment.
To prevent that, organizations charged with protecting private data have traditionally used de-identification methods including anonymization, pseudonymization, encryption, key-coding and data sharding to distance PII from real identities, according to the Stanford Law Review article. While anonymization protects privacy by removing names, addresses and social security numbers, pseudonymization replaces this information nicknames, pseudonyms and artificial identifiers. Key-coding encodes the PII and establishes a key for decoding them. Data sharding breaks off part of the data in a horizontal partition, providing enough data to work with but not enough to reidentify an individual.
Sign up for CIO Asia eNewsletters.