It was my colleague’s birthday last Monday. Her closest friends knew. The rest of us in the office knew, once she bought us all free coffees. And Google.
She wasn’t surprised that Google knew, but found it unnerving that Google chose to let her know that it knew by displaying a personalised version of its logo with birthday-like icons and a Happy Birthday Sandra label. Did she provide that information to Google at some point? Probably. Did they scrape it from another online cache? No idea. How would she know?
As an advisor to technology companies, we occasionally get asked about the legal risks associated with harvesting data via scraping and similar means. If this is a major part of your game plan, here are the starters you should be thinking about.
Consider the information you are extracting – are you taking information that is personally sensitive (such as personal contact details) or commercially sensitive (such as brand names)? The legal rules around taking personal information, and using another’s brand for one’s own commercial purposes, are both well-developed areas of law and highly protectionist. Are you taking images or compilations of information that are likely to be seen as proprietary either because it is highly original or would have been a labour intensive exercise to collate (e.g. images, or product catalogues)?
Although the law is always a step behind technological developments, the sentiment of the legal cases to date in a number of jurisdictions is that the website owner should be able to prevent scrapers from harvesting information without authorisation.
Typically, harvested data includes product descriptions and pictures reproduced from other sites. As soon as you are reproducing another person’s text or images you raise legal issues of potential copyright infringement. These risks are lessened if (a) the images and text are not reproduced in whole, (b) the text is not reproduced verbatim but, as your school teacher would say, restated in your own words (taking care not to mislead or misstate any aspect of the goods or services, however), (c) the images are unoriginal, do not reproduce trade marks, are sourced from a different place to the product description, or otherwise are less likely to be the subject of copyright held by the same owner as the other extracted information.
Sign up for CIO Asia eNewsletters.