Most organizations have well established procedures for vetting and sharing computer code. But what about data analysis?
Important findings are often held in "a mixed bag of presentations, emails, and Google Docs," two members of Airbnb's engineering and data science team blogged at Medium in February. When someone in the organization wants to locate and use that existing work, they often have to track down updated code and waste time checking and reproducing earlier results. And then they'll typically distribute their own findings "through a presentation, email, or Google Doc, perpetuating the cycle."
After considering various ideas on how to solve this problem, Airbnb created an internal Knowledge Repo, combining git version control and Markdown templates for reporting results. Airbnb recently open-sourced its Knowledge Repository Beta, seeking contributors to help move the project forward.
Git allows the same sort of peer review and version control that developers typically use to collaborate on code, while Markdown offers a mixture of text and code in a single, easily reproducible file. You can see RStudio's tutorial on R Markdown for more info of what Markdown in general can do. Markdown is available for other languages such as Python as well.
The Airbnb framework setup requires Python and supports "knowledge posts" in several formats.
"Posts are written in Jupyter notebooks, Rmarkdown files, or in plain Markdown, but all files (including query files and other scripts) are committed. Every file starts with a small amount of structured meta-data, including author(s), tags, and a TLDR," according to the Medium post, Scaling Knowledge at Airbnb. "A Python script validates the content and transforms the post into plain text with Markdown syntax. We use GitHub’s pull request system for the review process. Finally, there is a Flask web-app that renders the Repo’s contents as an internal blog, organized by time, topic, or contents.
"It provides various data stores (and utilities to manage them) for "knowledge posts," with a particular focus on notebooks (R Markdown and Jupyter / iPython Notebook) to better promote reproducible research," according to the GitHub repository. "The Knowledge Repository is a work in progress. There are lots of code cleanups and feature extensions TBD. Your assistance and involvement is more than encouraged."
Sign up for CIO Asia eNewsletters.