Skip to main content

The Hidden Data Science Pipeline

Mark Mims (Infochimps)
Data Science Beekman Parlor - Sutton North
Average rating: **...
(2.67, 6 ratings)

The practice of every day data science is mostly one of plumbing. No matter the size. If you’re constantly moving and munging datasets around, sampling from taps of a big hose, or slinging analysis tools around to various shards of larger datasets… your results are highly dependent on the underlying plumbing.

This is dangerous. Why? Because plumbing often falls through the cracks. It’s not captured and managed in the same way as the beloved source that implements the key algorithms for a business. There’s often this hybrid hodge-podge of scripts and deployment recipes… most thrown together in a one-off manner and quietly added to the defacto processing pipeline.

If your professional data scientist (recently poached grad-student) got hit by a bus, you’d have a hard time reproducing any of your infrastructure. If you need to change your pipeline to include new data feeds or outputs, it’s hard to get visibility of the scope of such a change until you’re knee-deep in it. These problems can be fixed.

This talk is a call to arms for data science teams. We’ll cover:

  • the data science pipeline. You already have one… you just don’t necessarily have visibility or control over it. Let’s walk through techniques to define and visualize as much of the data science process as possible
  • tests. Turn sanity checks and wireframe renders into actual tests of your pipeline… and run them

Let’s leverage tools where they exist and identify gaps we find along the way. Also, let’s do all of this in a way that preserves the agility of experimentation without compromising the reproducibility of results.

Mark Mims

Infochimps

Mark’s a physicist by training and programmer by trade.

He’s architected data-driven solutions, on both bare metal and clouds, across a variety of industries including Energy, Education, and Commercial Modeling and Simulation.

Mark received a doctorate in Mathematical Physics from UT Austin for research simulating quantum algorithms. He is interested in what it takes to train data scientists and is working to add “data science” tracks to various degree programs at Stanford and Utah State University.

Mark’s passion is Data Plumbing, where Data Science meets the real world of DevOps and Infrastructure Engineering. He is currently employed by Canonical building DevOps tools for Ubuntu Server and making sure that the Ubuntu Server operating system meets the needs of Data Plumbers everywhere.

Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners
@oreilly.com

Press & Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts