Data Ingest, Linking, and Data Integration via Automatic Code Generation

Tony Middleton (HPCC Systems from LexisNexis Risk Solutions)
Data Science, Mission City B1
Average rating: ****.
(4.00, 1 rating)

One of the most complex tasks in a data processing environment is record linkage, the data integration process of accurately matching or clustering records or documents from multiple data sources containing information which refer to the same entity such as a person or business.

The massive amount of data being collected at many organizations has led to what is now being called the “Big Data” problem which limits the capability of organizations to process and use their data effectively and makes the record linkage process even more challenging.

New high-performance data-intensive computing architectures supporting scalable parallel processing such as Hadoop MapReduce and HPCC allow government, commercial organizations, and research environments to process massive amounts of data and solve complex data processing problems including record linkage.

A fundamental challenge of data-intensive computing is developing new algorithms which can scale to search and process big data. SALT (Scalable Automated Linking Technology) is new tool which automatically generates code in the ECL language for the open source HPCC scalable data-intensive computing platform based on a simple specification to address most common data integration tasks including data profiling, data cleansing, data ingest, and record linkage.

SALT is an ECL code generator for use with the open source HPCC platform for data-intensive computing. The input to the SALT tool is a small, user-defined specification stored as a text file which includes declarative statements describing the user input data and process parameters, the output is ECL code which is then compiled into optimized C++ for execution on the HPCC platform.

The SALT tool can be used to generate complete applications ready to-execute for data profiling, data hygiene (also called data cleansing, the process of cleaning data), data source consistency monitoring (checking consistency of data value distributions among multiple sources of input), data file delta changes, data ingest, and record linking and clustering.

SALT record linking and clustering capabilities include internal linking – the batch process of linking records from multiple sources which refer to the same entity to a unique entity identifier; and external linking – also called entity resolution, the batch process of linking information from an external file to a previously linked base or authority file in order to assign entity identifiers to the external data, or an online process where information entered about an entity is resolved to a specific entity identifier, or an online process for searching for records in an authority file which best match entered information about an entity.

SALT Use Case – LexisNexis Risk Solutions Insurance Services used SALT to develop a new insurance header file and insurance ID to combine all the available LexisNexis person data with insurance data. Process combines 1.5 billion insurance records and 9 billion person records. 290 million core clusters are produced by the linking process. Reduced source lines of code from 20,000+ to a 48 line SALT specification. Reduced linking time from 9 days to 55 hours. Precision of 99.9907 was achieved.

Summary and Conclusions – Using SALT in combination with the HPCC high-performance data-intensive computing platform can help organizations solve the complex data integration and processing issues resulting from the Big Data problem, helping organizations improve data quality, increase productivity, and enhance data analysis capabilities, timeliness, and effectiveness.

Photo of Tony Middleton

Tony Middleton

HPCC Systems from LexisNexis Risk Solutions

Tony Middleton, Ph.D. Sr. Architect, Data Scientist LexisNexis Risk Solutions/HPCC Systems Tony.Middleton@LexisNexis.com

Dr. Middleton has worked with the HPCC Systems technology platform and the ECL programming language for more than 11 years with all types of structured and unstructured data. He specializes in data research and developing new and innovative approaches to processing and using data. He has previously worked with other Big Data companies including Standard & Poor’s Computstat.

Sponsors

  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com.

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

View a complete list of Strata contacts