Schedule: Data Science sessions

A look at what it means to work with data for a living. Data Science is a flourishing discipline, combining math, engineering, writing and skepticism in equal measure, as John Rauser puts it.

Ballroom CD
Please note: to attend, your registration must include Tutorials.
Sarah Sproehnle (Cloudera, Inc.)
Average rating: ****.
(4.83, 6 ratings)
This tutorial provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop, plus its associated ecosystem. This session is intended for those who are new to Hadoop and are seeking to understand where Hadoop is appropriate and how it fits with existing systems. No programming experience is required. Read more.
Ballroom E
Please note: to attend, your registration must include Tutorials.
Ken Krugler (Scale Unlimited)
Average rating: **...
(2.75, 4 ratings)
Want to extract and process Big Data from the web? This tutorial will show you how to use key open source technologies such as Hadoop, Cascading, Bixo, Tika, Mahout and Solr to create scalable, reliable web mining solutions. Read more.
Ballroom G
Please note: to attend, your registration must include Tutorials.
Joseph Rickert (Revolution Analytics)
Average rating: ****.
(4.50, 4 ratings)
This tutorial will enable anyone with some programming experience to begin analyzing data with the R programming language Read more.
Ballroom H
Please note: to attend, your registration must include Tutorials.
Dean Wampler (Typesafe), Jason Rutherglen (Datastax)
Average rating: ***..
(3.00, 1 rating)
This hands-on tutorial teaches you how to setup and use Hive, a high-level, data warehouse tool for Hadoop. Hive provides a SQL-like query language, HiveQL, that is easy to learn for people with prior SQL experience, making Hive attractive for data warehousing teams. Hive leverages the power of Hadoop for working with massive data sets without requiring expertise in MapReduce programming. Read more.
Ballroom CD
Please note: to attend, your registration must include Tutorials.
Jeremy Howard (Kaggle), Mike Bowles (Biomatica)
Average rating: ****.
(4.44, 9 ratings)
Wouldn't it be great if there were just use two algorithms which could handle most of your predictive modeling needs? It turns out that actually this is the case. Noted machine learning instructor Dr Mike Bowles and champion data miner Jeremy Howard will teach you everything you need to know to apply them successfully. Read more.
Ballroom E
Please note: to attend, your registration must include Tutorials.
Simon Rogers (Guardian), Michael Brunton-Spall (Guardian News and Media)
Average rating: ****.
(4.00, 1 rating)
Learn first hand from award-winning Guardian journalists how they mix data, journalism and visualization to break and tell compelling stories: all at newsroom speeds. Read more.
Ballroom G
Please note: to attend, your registration must include Tutorials.
Nate McCall (Apigee)
This presentation goes beyond the hype, buzzwords, and rehashed slides and actually presents the attendees with a hands-on, step-by-step tutorial on how to write a Java application on top of Apache Cassandra. It focuses on concepts such as idempotence, tunable consistency, and shared-nothing clusters to help attendees get started with Apache Cassandra quickly while avoiding common pitfalls. Read more.
Ballroom H
Please note: to attend, your registration must include Tutorials.
Jock Mackinlay (Tableau Software), Ross Perez (Tableau Software)
Average rating: ****.
(4.00, 2 ratings)
In this hands-on class, learn how to turn data into effective, interactive visualizations. You do not require a Tableau license to participate, but must bring a Windows laptop or virtual machine. Read more.
GA J
Please note: to attend, your registration must include Tutorials.
Sarah Sproehnle (Cloudera, Inc.)
Average rating: ****.
(4.25, 4 ratings)
Learn now how to use a Hadoop cluster for data analysis using Java MapReduce, Apache Hive and Apache Pig, and get an overview of using the HBase Hadoop database. Some programming experience is strongly recommended for this session. Read more.
Mission City B1
Q Ethan McCallum (@qethanm)
Average rating: *....
(1.00, 1 rating)
The biggest problem in data science is ... the data itself. Read more.
Mission City B1
Peter Skomoroch (Data Wrangling)
Average rating: ***..
(3.00, 2 ratings)
New analysts or engineers are often lost when textbook approaches fail on real world data. Drawing inspiration from problem solving techniques in mathematics and physics, we will walk through examples that illustrate how come up with creative solutions and solve real world problems with data. Read more.
Mission City B1
Tony Middleton (HPCC Systems from LexisNexis Risk Solutions)
Average rating: ****.
(4.00, 1 rating)
How to simplify the data integration process and save a significant amount of development time by automatically generating code for processes (data profiling, data cleansing, and record linkage). A case study will show a complex, Big Data linking application, where insurance data was converted to HPCC using the SALT tool and reduced 20,000+ lines of source code to a 48-line SALT specification. Read more.
Mission City B1
Philip (Flip) Kromer (Infochimps, a CSC Big Data Business)
Average rating: *****
(5.00, 1 rating)
Instead of working too hard to define the parameters in an attempt to completely remove the ambiguity, look at what people do, interact with and talk about. We can watch what people do and decide from there what a coffee shop is and where the boundaries of your neighborhood are. It might not be the “truth”, but it can be darn close. Read more.
Mission City B1
This presentation will be streamed live.
Xavier Amatriain (Netflix)
Average rating: ****.
(4.50, 2 ratings)
Netflix is known for pushing the envelope of recommendation technologies. The Netflix Prize put a spotlight on recommender system research and a focus on predicting ratings. But, predicting a rating is only part of the recommendation problem. In this talk I will describe how other sources of implicit and contextualized information can be used to create a personalized experience. Read more.
Ballroom CD
Average rating: ***..
(3.00, 1 rating)
How do you architect big data systems that leverage virtualization and platform as a service? We will walk through a layered approach to building a unified analytics platform using virtualization, provisioning tools and platform as a service. Read more.
Mission City B1
Joris Poort (Startup)
Data science applied in engineering driven industries is revolutionizing how highly complex products are developed. Unprecedented access to computing power combined with advanced data science tools provide the opportunity to not only increase the speed of development but also improve the final design. Using a practical aerospace example, Joris will illustrate the tools and techniques described. Read more.
Ballroom CD
Ana Martinez (CityGrid Media), Kin Lane (API Evangelist)
Learn how Citygrid built a world class platform to aggregate the data powering it's publicly available local places, content and ads APIs using Hadoop, Solr and MongoDB. Read more.
Mission City B1
This presentation will be streamed live.
Theo Schlossnagle (OmniTI/Circonus)
Average rating: ***..
(3.67, 3 ratings)
In today's environments, we're often forced to collect data before we know if it will be useful. This tendency leads toe seas of data, flowing in real-time with very little structure or understanding of what the data means. Given that, how can you tell when data "is normal?" Let's find out. Read more.
Mission City B1
Jeremy Howard (Kaggle)
Average rating: ****.
(4.50, 4 ratings)
In "The Evolution of Data Products", O'Reilly Media's Mike Loukides notes: "the question of how we take the next step — where data recedes into the background — is surprisingly tough." Jeremy Howard will show why this is tough, and what to do about it. He will show how predictive modelling, simulation, and optimization can be combined to deliver results instead of just delivering data. Read more.
Mission City B1
Alyona Medelyan (Pingar), Anna Divoli (Pingar)
Average rating: ***..
(3.00, 1 rating)
In this session we discuss approaches to mining unstructured data that gradually find their way into the real world. Text mining and analytics algorithms strive to identify documents’ categories, main topics, mentioned names and other entities; they summarize and detect sentiment. We describe case studies that take advantage of such algorithms in the legal, forensics and healthcare sectors. Read more.
Mission City B1
Alasdair Allan (The Thing System, Inc.)
Average rating: *****
(5.00, 1 rating)
Big data isn't just about multi-terrabyte data sets hidden inside eventually-concurrent distributed databases in the cloud. It's also about the hidden data you carry with you all the time. This talk will discuss the data that you carry with you all the time; the data on your cell phone and other mobile devices, along with the possibilities for making use of that hidden data. Read more.
Mission City B1
Daniel Tunkelang (LinkedIn), Claire Hunsaker (Samasource)
Average rating: ****.
(4.00, 1 rating)
In this talk, we will analyze various dimensions of microwork that characterize applications, tasks, and crowds. Drawing on our experience at companies that have pioneered the use of microwork (Samasource) and data science (LinkedIn), we will offer practical advice to help you design crowdsourcing workflows to meet your data product needs. Read more.
Mission City B1
Jan Reichelt (Mendeley Ltd.), William Gunn (Mendeley Research Networks)
Mendeley is a New York and London-based startup that has crowdsourced the world's largest database of academic literature. Over 1M researchers strong, Mendeley is taking academia to the cloud. Read more.
Ballroom CD
Paul Brown (Paradigm4 Inc.)
The science and commercial worlds share requirements for a high performance informatics platform to support collection, curation, collaboration, exploration, and analysis of massive datasets. SciDB is an open source analytical database that provides seamlessly integrated massively scalable analytics. We present performance and scalability for non-embarrassingly parallel operations. Read more.
Ballroom E
Peter Kuhn (Scripps Physics Oncology)
Average rating: ****.
(4.50, 6 ratings)
Metastasis is the lethal form of cancer. Metastasis arises through cancer cells traveling through the blood of the patient and colonizing in other organs. Finding and characterizing these cells enables the prediction and monitoring of response to cancer treatments. Read more.

Sponsors

  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com.

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

View a complete list of Strata contacts