How Ancestry.com is using big data to map people, places and time

Online genealogy service Ancestry.com  is trying to become like the Amazon or Netflix of family trees. Much like those companies use customer data to recommend products or movies customers might like, Ancestry.com wants to feed its users relevant historical records and other information on ancestors without making them search through its database. And it’s taking in everything from newspaper clippings to your DNA to make this happen.

It you’ve used Ancestry.com recently, you’re probably thankful for its efforts. According to Head of Engineering Scott Sorenson, Ancestry.com has more than 10 billion records that are part of a 4-petabyte (or 4-million gigabyte) data store. If you’re searching for “John Smith,” he explained, it probably has about 60 million for “Smith” and about 4 million for “John Smith,” but you’re only interested in the relative handful that are relevant to your John Smith.

Making models smarter

That’s why Ancestry.com is using machine learning to make sorting through those records a lot less like finding a needle in a haystack and a lot more like having that needle — and any others made from the same batch of steel — delivered right to your door. Here’s how the process works, in a nutshell:

  1. Crawl digital records (e.g., newspapers, birth records, death records, census data, ship manifests, etc.) online and extract relevant data
  2. (Or 1(a)) Scan, upload and index physical records (via a partner in China)
  3. Stitch together new records with user data to add more context
  4. And this is key, constantly analyze user behavior in order to make its algorithms smarter

As users make judgments about the records they’re presented, Sorenson said, Ancestry.com’s algorithms get better at performing their particular tasks. So, a system for extracting data from newspaper pages might be able to better recognize the various sections of the page (so as to ignore the ads, for example) and then be able to adjust for mistakes in the section it is analyzing. And as with Google’s search algorithms, the more that users interact with records, the better Ancestry.com’s sorting algorithms are able to determine those records relevance to any given user.

Spit in a tube, pay $99, learn your past

Oh, but Ancestry.com has decided that merely storing and analyzing historical records is just the beginning with regard to providing accurate genealogy information. It also will sequence your DNA, focusing on 700,000 markers important to determining one’s race, lineage and other factors. That service, which simply requires users to swab their cheek or spit in a tube and send it to the lab, costs only $99 (a full genome sequence would cost at least 10 times that, by the way), but could revolutionize the accuracy of Ancestry.com’s models.

Right now, Sorenson said, the DNA service can tell users their race and what country they’re from, and also connect them with other relatives who share a DNA profile. (If your privacy red flag has gone up reading this, Sorenson did note the following: all communications with relatives are optional and initially anonymous; all DNA information is disassociated from personal information; and users get their sequence results via an encrypted key “that we treat with a higher level of security than we’d store your credit card information.”)

Connecting with distant relatives can be valuable, though. A third cousin, for example, might have ancestral information that you don’t, which will help make your family tree that much more accurate. But Sorenson said when it really gets interesting is when Ancestry.com can combine DNA data with record data in family trees. Someone’s DNA might indicate he’s from France, Sorenson explained, but cross-checking that against that person’s family data will let the service discover he’s actually from the Normandy region.

Going forward, Sorenson said Ancestry.com expects its DNA service to take off like a rocket. The company is investing between $10 million and $15 million into that service over the next couple years, and has bioinformatic scientists on staff trying to scale algorithms designed to handle hundreds of samples to work with hundreds of thousands or even millions of samples. In that regard, though, Ancestry.com isn’t alone — the steady drop in the price of genome sequencing has everyone in the sector anticipating skyrocketing data volumes.

What’s next: Telling stories and making genealogy real-time

OK, so it has billions of records and our DNA, what more can Ancestry.com possibly want or need to provide us information on our ancestors? Nothing, actually.

It just needs to make better use of what it does have and the new technologies available for working with that information. Genealogy has traditionally been “dusty,” Sorenson explained, but Ancestry.com is trying to tell the stories behind those dusty records. If you’ve seen the NBC program “Who Do You Think You Are?”, on which Ancestry.com traces celebrities’ ancestral roots, you have an idea of what Sorenson is talking about.

For example, by improving its image-processing capabilities, Ancestry.com could extract more information than just name, data and location from old records that it already knows how to process. It could tell someone that his grandfather was the only person on the block to own a radio, or whether he owned his home. Combined with socioeconomic and other external data, Sorenson said, Ancestry.com could “create a really vivid picture” of what it was like to live during a specific time.

By using location data from cell phones, Sorenson said Ancestry.com could deliver a mobile experience that’s far more than a translation of the web on a smaller screen by making genealogy a geospatial pursuit. For example, Sorenson, explained, if a user takes a picture of a gravestone, Ancestry.com would like to provide him with relevant historical data related to that place, and maybe even some nearby points of interest.

Some might think Ancestry.com’s practices and plans toe the privacy line, but if someone has to toe that line, this might be the company to do it. In a fast-paced world it’s easy to get tied up in the moment and in our own little worlds — especially with big data being used elsewhere on the web to keep our attention firmly on one site or another. Using personal data to let users dig into decades into their family histories ends up looking very refreshing.

Feature image courtesy of Shutterstock user tovovan.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.