The practice of every day data science is mostly one of plumbing. No matter the size. If you’re constantly moving and munging datasets around, sampling from taps of a big hose, or slinging analysis tools around to various shards of larger datasets… your results are highly dependent on the underlying plumbing.
This is dangerous. Why? Because plumbing often falls through the cracks. It’s not captured and managed in the same way as the beloved source that implements the key algorithms for a business. There’s often this hybrid hodge-podge of scripts and deployment recipes… most thrown together in a one-off manner and quietly added to the defacto processing pipeline.
If your professional data scientist (recently poached grad-student) got hit by a bus, you’d have a hard time reproducing any of your infrastructure. If you need to change your pipeline to include new data feeds or outputs, it’s hard to get visibility of the scope of such a change until you’re knee-deep in it. These problems can be fixed.
This talk is a call to arms for data science teams. We’ll cover:
Let’s leverage tools where they exist and identify gaps we find along the way. Also, let’s do all of this in a way that preserves the agility of experimentation without compromising the reproducibility of results.
Mark’s a physicist by training and programmer by trade.
He’s architected data-driven solutions, on both bare metal and clouds, across a variety of industries including Energy, Education, and Commercial Modeling and Simulation.
Mark received a doctorate in Mathematical Physics from UT Austin for research simulating quantum algorithms. He is interested in what it takes to train data scientists and is working to add “data science” tracks to various degree programs at Stanford and Utah State University.
Mark’s passion is Data Plumbing, where Data Science meets the real world of DevOps and Infrastructure Engineering. He is currently employed by Canonical building DevOps tools for Ubuntu Server and making sure that the Ubuntu Server operating system meets the needs of Data Plumbers everywhere.
For exhibition and sponsorship opportunities, contact Susan Stewart at email@example.com
For information on trade opportunities with O'Reilly conferences email mediapartners
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
View a complete list of Strata + Hadoop World 2013 contacts