Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

BLOG: Data migration is hard to do

Paul Venezia | April 17, 2013
Data munging and data migration come with big challenges -- and big rewards

We might strip these with a regex such as /[^0-9]+/and reassemble them after splitting them up with a regex like /([0-9]{3})([0-9]{3})([0-9]{4})/, with the resulting matches holding 212, 555, and 1212. We can now reformat the phone number however we like. We should also bail out if we encounter a number that cannot be a phone number because it has too many or too few digits.

The free-form free-for-all
It gets dicier as we move into more free-form fields. Addresses are particularly picky since they can be formatted in a wide variety of ways. We also need to deal with the vagaries of street and city names. We need to make sure that we correctly handle "Washington,DC," "Washington, DC," and "Washington DC" properly, along with oddities such as "Winston-Salem, NC," "King of Prussia, PA," "Scranton, Penn.," "N. Providence RI," "Houston, tx," and "O'Fallon, IL."

Those sorts of variations can trip up parsers if they're not accounted for because we can't strip special characters. Additionally, we can't count on a set number of spaces defining the city versus the state or the state abbreviation being present or capitalized. Thus, we need to construct a conditional expression to push that string through to make a best-effort determination of the actual city and state, perhaps even checked against a database consisting of every city and state in the United States. Depending on the result, we may still need to bail out on that record if no definitive determination can be made or at least throw a flag that the record in question needs to be manually checked.

We've only begun to scratch the surface. We have a significant amount of work invested in figuring out just the city, state, and phone number of each record. We need to rinse and repeat for every other field in the spreadsheet, depending on the content.

This mess is a direct result of unconstrained free-form data entry, and it plagues every company everywhere. It doesn't have to be Excel, either. It could be Access, a homegrown database, or any other application. Unless there are checks against the validity and formatting of data upon input, the data will likely be a mess. Of course, that's the point of building a proper database front end to handle data input: We can clean and groom the data on the way in, which greatly enhances the accuracy and usability of that data on an ongoing basis. That's one of the major benefits of using databases in the first place.

However, we cannot discount the effort involved in postprocessing these types of data sets. All kinds of tools have been developed to ease this process, but they will not apply to every case. While they may work for a portion of the input data, the parts that are missed may make their use more problematic than not.


Previous Page  1  2  3  Next Page 

Sign up for CIO Asia eNewsletters.