Jul 2016

Big Data At Ancestry.com: Why Data Stewardship And Open Source Matter So Much

Forbes

With more and more businesses becoming aware of the value locked inside the data they collect, many are becoming aware of a pressing problem. While traditionally a company has employed a data team to store, organize and distribute data, the sheer increase in size of data that large businesses are dealing with today means that model is quickly becoming outdated.

The most popular solutions involve implementing a company-wide data strategy and ensuring staff are engaged with data driven operations (often quoted as a major obstacle in achieving a truly data-driven culture) at all levels. One aspect of this is becoming known as data stewardship – giving all staff who work with data responsibility for its management.

A good example of a business built on a lot of data is Ancestry.com. The genealogy website has become hugely popular thanks to the data it has built up on family connections dating back almost 1,000 years.

Ancestry.com (ACOM) undertook a thorough restructuring of its data operations while deploying the open source Kafka platform. The primary aim was to move from a once-per-day batch processing data operation to real-time, on-the-fly processing. However, a by-product was an increased understanding of how data was used throughout the business.

Neha Narkhede, CTO of Confluent, and one of the original developers of Kafka while it was an internal project at LinkedIn LNKD -0.30%, tells me “Traditionally there is one team or set of people who really care about data, and that is the data warehouse team. “However if we look at how companies work, there are thousands of developers who are really creating data by the second, writing applications that produce data which is critical to the business.”

“And usually they are the people who just created it and threw it over the fence.” When data isn’t properly looked after it becomes meaningless and valueless. Worse, if it is out of date, divorced from its context or incorrectly categorized, it can be damaging if decisions are based on it.

Confluent’s solution, using Kafka, is to code a “metadata repository” into the system, allowing whoever is working with the data to define and redefine its format, in real time.

“This is a pretty big practical game changer,” says Narkhede, “as it allows applications to automatically publish metadata, and it allows applications which are interested in consuming that data to understand it, and to evolve it.”

Missing and mismatched metadata can cause serious problems for a business such as Ancestry, with a database containing over 13 billion records spread across more than 10 petabytes of storage. Chris Sanders, Director Data Warehouse and Visualization at Ancestry.com, says “We ran into problems where data just didn’t exist or it was inaccurate. For data warehousing, business intelligence, reporting and legal obligations, or to pay royalties, that’s a nightmare.

“Developers can now come in near real time and see that their production data is not just getting dropped off into a message queue or something where they have no idea – they can actually become data stewards now.”

Ancestry’s approach is certainly one which I can see becoming more and more popular as businesses find themselves dealing with an ever increasing amount of data, touching on the workload of a greater number of employees.

As well as the move from batch to real-time data processing, LinkedIn’s adoption of Kafka also reflects a broad trend in the industry towards increasing use of open source technology. “The problem is very simple,” says Narkhede, “Data is critical to companies and any kind of system which locks people into a certain software that essentially holds the most essential aspect of a company – which is their data – is unacceptable.

“So open source is changing that, because even if they change vendors the customer knows their data is there because it flows through an open source system.”

Companies, for the foreseeable future, are going to be busy coming up with ways to make sure they are extracting value from their data. They know full well that if they are not, then they have competitors who certainly will be.

Open source greatly reduces the workload by removing the need for a great deal of investment in expensive, bespoke infrastructure. And data stewardship – when it is rolled out throughout a business – reduces the risks posed by bad, out of date or inaccurate information.

Both are tactics which I can see becoming popular for companies of all sizes as data and analytics become increasingly important in maintaining a competitive edge.

Article written by Bernard Marr, Forbes.