Taming Data Logistics - the Hardest Part of Data Science

Ken Farmer (IBM)
Data
Location: Sutton South

While most of the focus within data science is on the rapid analysis of vast volumes of data, the hardest part of most solutions is the data acquisition, movement, transformation, and loading – the “data logistics”.

Since the early days of data mining and data warehousing (in the late 80s and early 90s) it has been understood that 90% of the effort of these projects will be spent on data acquisition, cleansing, transformation and consolidation. The challenges include:

  • undocumented source systems
  • source systems that change business rules without notice
  • source systems that cannot handle frequent extracts of data without encountering concurrency problems
  • source system constraints on languages, network connections, and products
  • the management of thousands of daily processes
  • the management of data logistics code that manages dozens of feeds
  • the rapid loading of data into the consolidated server – without impacting concurrency or creating temporary data inconsistencies

The data warehousing domain refers to data logistics as “ETL” for Extract, Transform, and Load. Some best practices and methods have developed to address these challenges, but little effort has been put into reusable patterns – more effort has gone into mostly commercial products. But in spite of a lack of formal patterns, a sense of what works and what doesn’t work has emerged – and can be read “between the lines” if someone knows what to look for.

This presentation will describe what the challenges look like when trying to deliver data insights that out of necessity span many sets of data. It will explain these in both business and technical terms. And then will procede to address some of the common solutions – and their strengths and weaknesses.

Photo of Ken Farmer

Ken Farmer

IBM

Ken Farmer has twenty years of experience in delivering innovations through data logistics: the unglamorous part of data science involved in acquiring, standardizing, validating, transforming, integrating, and enabling the availability and access to vast amounts of data.

Ken is a senior data architect at IBM where he leads their security & compliance data warehouse. Prior to this role Ken consulted on search engines and data warehouses in the insurance, telecom, entertainment, and retail industries.

Sponsors

  • Aster Data
  • EMC Greenplum
  • GE
  • Lexis Nexis
  • MarkLogic
  • Tableau Software
  • Cloudera
  • DataStax
  • Informatica
  • DataSift
  • Splunk
  • Amazon Web Services
  • Datameer
  • Impetus
  • Karmasphere
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Sybase
  • Xeround
  • Media-Science
  • Platfora

Sponsorship Opportunities

For information on sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com

Press & Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata Contacts