Skip to main content

Data Science

This track looks at algorithms, math, statistics, and machine learning; at getting the most from data science teams; and at the tools and best practices that data scientists use to ply their craft.
Who should attend: Data scientists, researchers, and mathematicians.

Track Hosts

Anna Smith is a resident data scientist at bitly in New York while being in absentia from the University of Oregon physics doctorate program. Recently, she has published in both Forbes and Publications of the Astronomical Society of Australia. Her interests include manipulating data and catching up on the latest celebrity gossip.

Max Shron is a New York-based data strategist. He provides expertise and mentorship ranging from specification design and platform architecture to strategy execution, to organizations across a wide gamut of sizes and industry verticals. This work encompasses a complete data pipeline including definition, collection, analysis, visualization, and insight. Max previously was lead data scientist at New York-based OkCupid, and participated as the big-data side of its successful OkTrends blog. His work has appeared worldwide, in outlets including the New York Times, Chicago Tribune, Huffington Post and WNYC. Max holds a degree in Mathematics from the University of Chicago.

Add to your personal schedule
Regent Parlor
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Matt Harrison (FusionIO)
Average rating: ***..
(3.71, 7 ratings)
This Tutorial will jumpstart your Python experience. Learn the basics-enough Python to be dangerous. Then use two of the most popular packages for analysis, Matplotlib for plotting, and Pandas for data wrangling. This will be a hands-on tutorial, so bring a laptop with Python 2.7 installed, and the gumption to hit the ground running and see what everyone is raving about. Read more.
Add to your personal schedule
Nassau Suite
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Antonio Piccolboni (Per data LLC), Joseph Rickert (Revolution Analytics)
Average rating: ***..
(3.40, 5 ratings)
This tutorial is aimed at R users who want to use Hadoop to work on big data and Hadoop users who want to do sophisticated analytics. We will introduce to R, Hadoop and the RHadoop project. We will then cover three R packages for Hadoop and the mapreduce model. We will present numerous examples of incremental complexity including the combination of rmr and RevoscaleR to solve modeling problems. Read more.
Add to your personal schedule
Murray Hill Suite
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Giovanni Seni (Intuit)
Average rating: ***..
(3.44, 9 ratings)
This tutorial, based on a published book by the speaker, offers a hands-on intro to ensemble models, which combine multiple models into a single predictive system that’s often more accurate than the best of its components. Participants will use data sets and snippets of R code to experiment with the methods to gain a practical understanding of this breakthrough technology. Read more.
Add to your personal schedule
Nassau Suite
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Leah Hanson (Google)
Average rating: ****.
(4.00, 1 rating)
Julia is a high-performance, open source language with great tools for numerical and statistical work. If you know R, MATLAB, or NumPy, you will feel at home in Julia. Unlike these systems, however, Julia takes advantage of modern compiler technology, combining an intuitive programming model with the speed of a low-level language. This workshop will take you from installed to productive in Julia. Read more.
Add to your personal schedule
Rhinelander South
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Matthew Russell (Digital Reasoning)
Average rating: *****
(5.00, 8 ratings)
A code-intensive workshop that breaks down the nuts and bolts of using IPython Notebook to uncover insights from social web APIs such as Twitter, Facebook, LinkedIn, and Google+. Attendees with a basic programming background will walk away with a working knowledge of how to access and mine valuable information the social web. Read more.
Add to your personal schedule
Beekman Parlor - Sutton North
John Foreman (MailChimp)
Average rating: ****.
(4.50, 10 ratings)
MailChimp's first big data effort, the Email Genome Project, was internal, focused on abuse-prevention. But once this centralized storage and analytics capability demonstrated its practical value, the company turned toward crafting user-facing big data products. This talk will detail the results of MailChimp's effort to democratize big data analysis in email marketing for their users. Read more.
Add to your personal schedule
Beekman Parlor - Sutton North
Jonathan Natkins (WibiData), Juliet Hougland (Self)
Average rating: ****.
(4.20, 10 ratings)
Consumer expectations have dramatically increased and retailers must present relevant content to maintain a competitive advantage. This presentation will demo an e-commerce application with real-time, personalized recommendations and discuss combining open-source system architecture, based on HBase and Kiji, with good predictive model design to build a scalable, real-time recommendation system. Read more.
Add to your personal schedule
Beekman Parlor - Sutton North
Moderated by:
Steve Lohr (The New York Times | Brown Institute for Media Innovation at Columbia University)
Panelists:
Chris Wiggins (hackNY/Columbia), Yann LeCun (NYU), Deborah Estrin (Cornell NYC Tech)
Average rating: ***..
(3.50, 4 ratings)
What can Data Science do for NYC? What can NYC do for Data Science? Deborah Estrin, first faculty member at CornellTech NYC, Chris Wiggins, cofounder of hackNY and member of the Institute for Data Sciences and Engineering at Columbia, and Yann Lecun, Director of the Center for Data Science at NYU, will answer these questions and more about the current and future of Data Science in NYC. Read more.
Add to your personal schedule
Beekman Parlor - Sutton North
Mark Mims (Infochimps)
Average rating: **...
(2.67, 6 ratings)
This is a talk about the practice of data science. It's about taking all the implicit bits of the data science pipeline and exposing them to the light of day. We'll walk through developing and managing such a data science "pipeline" and cherrypick a few practices from the software development world to improve the quality and stability of results. Read more.
Add to your personal schedule
Beekman Parlor - Sutton North
Baron Schwartz (VividCortex Inc)
Average rating: ***..
(3.47, 19 ratings)
What if data doesn't need to be big? Many use cases can be served well by a Small Data mindset, trading off accuracy in return for decreased cost. Examples include Bloom Filters, moving averages, and downsampling. This talk presents ideas and options you might not have considered for reducing big problems to comparatively small and cheap ones. Read more.
Add to your personal schedule
Beekman Parlor - Sutton North
Srisatish Ambati (0xdata Inc), Cliff Click (0xdata)
Average rating: ***..
(3.17, 6 ratings)
Get both Big Data AND Better Algorithms with opensource math and prediction engine, H2O. Once data science gets past scale & sampling: Asymmetric and unbalanced data and missing elements impact yields of popular algorithms in data science. We present life cycle of Big Data Modeling. H2O brings scale to the versatile R language bringing scale to the math community. Read more.
Add to your personal schedule
Beekman Parlor - Sutton North
Ulrich Rueckert (Datameer)
Average rating: ****.
(4.50, 6 ratings)
Even if one has big data, sometimes there is a lack of key data. This is a problem for predictive analytics: if there is only a limited amount of training material (e.g. user ratings, categorized documents), then it is hard to generate accurate models. The talk introduces new semi-supervised learning methods to overcome this problem by utilizing the vast amount of unlabeled data. Read more.
Add to your personal schedule
Beekman Parlor - Sutton North
Wes McKinney (DataPad Inc.)
Average rating: ****.
(4.00, 3 ratings)
This talk will look at end-to-end data workflows (i.e. the sequence of preparation, analysis, visualization, and collaboration) and discuss technologies and tools (both programming and UI-driven) that can help individuals and organizations do more with their data. Read more.
Add to your personal schedule
Beekman Parlor - Sutton North
Average rating: **...
(2.80, 10 ratings)
Voice of the customer (VOC) data is a rapidly growing, unstructured, untapped data source – for your web site and across social media sites. Topic discovery through clustering of user verbatims, integrated with decision support data, can unleash valuable, actionable insights from millions of customers. Read more.
Add to your personal schedule
Beekman Parlor - Sutton North
Zack Exley (Wikimedia Foundation), Sahar Massachi (Independent)
Average rating: ****.
(4.00, 13 ratings)
There's something about AB testing that invites statistical malpractice, and that makes communication between academics and practitioners very difficult. Wikipedia's revenue is depends on doing testing right. We'd like to present simple methods that we believe accurately predict future performance from AB test results, while minimizing sample size, along with proofs from four years of test data. Read more.
Add to your personal schedule
Beekman Parlor - Sutton North
Robert Johnson (Interana)
Average rating: ***..
(3.30, 10 ratings)
Many of the world's largest datasets are time series. With today's technology the number of things in the world doesn't seem that big, but how those things change over time is. Unfortunately many data tools don't natively consider time a first-class concept. I'll be talking about a variety of ways to organize your data and architect your data systems to get the most out of your time-based data. Read more.
Add to your personal schedule
Beekman Parlor - Sutton North
Vaclav Petricek (eHarmony)
Average rating: ****.
(4.45, 11 ratings)
Humans have a mixed record in choosing romantic partners. Are looks or brains more important for a happy marriage? This session will show you how big data and large scale machine learning can help us model such a complex behavior and tell us which traits in a partner actually matter. Who knows - maybe hadoop will help you find Love ;-) Read more.

Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners
@oreilly.com

Press & Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts