Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

LinkedIn open sources its database change capture system

Mark Gibbs | Feb. 28, 2013
OK, lots of interesting stuff for you this week. First up, LinkedIn has open sourced a system called Databus, a real-time database change capture system that provides a "timeline-consistent stream of change capture events ... grouped in transactions, in source commit order."

OK, lots of interesting stuff for you this week. First up, LinkedIn has open sourced a system called Databus, a real-time database change capture system that provides a "timeline-consistent stream of change capture events ... grouped in transactions, in source commit order."

This code has been in use inside LinkedIn since 2011 where they use it to "propagate profile, connection, company updates, and many other databases at LinkedIn."

Databus is designed for speed, providing end-to-end latencies of milliseconds and "throughput of thousands of change events per second per server while supporting infinite lookback capabilities and rich subscription functionality."

The infinite lookback feature is said to allow client applications to create a copy of any or all changes to the source database without placing any extra load on the source. It also apparently allows clients to stop and restart acquiring updates so that client-side processing demands or performance limitations can be handled.

The Databus architecture feeds updates from the host database connector to Databus Relays. These relays buffer the serialized change data events in memory and, on demand, send change events to Databus Clients and Databus Bootstrap Producers.

The Databus Bootstrap Producers are, in effect, archives of change data events which are stored in separate MySQL databases.

When new Databus Clients connect, they first query a Databus Bootstrap Server which requests "look back data" (change data that is older than the change data currently stored by the Databus Relays) and then, when a client has "caught up" with events so that it is current on changes, it switches to a Database Relay for realtime change data events.

Databus is independent of the source database but a connector is required to interface with the host database. This release provides an Oracle connector and a MySQL connector is due to be released "soon."

It would be interesting to see a high performance database like NuoDB, which I discussed a few weeks ago and which was designed as an Oracle "drop in" replacement, paired with Databus to provide a distributed, low management overhead, database solution with realtime backup and realtime update delivery. An interesting possibility of this architecture is making it possible for clients to perform analytics on any or all of the data over any historical period without impacting the host database performance at all.

So, with this combination of databases echoing databases you'll be needing some serious storage for your non-realtime backend, right? All that analytic and historical stuff chews up disk space like there's no tomorrow.

You'll probably be looking for something in the 100TB range and, while you could go out and buy from the big guys such as NetApp or Drobo, you'll certainly be in for some sticker shock.

 

1  2  3  Next Page 

Sign up for CIO Asia eNewsletters.