Skip to main content

Schedule: Hadoop and Beyond sessions

There’s a revolution underway in how we harness and make sense of the world’s information. In this track, we’ll dive into the tools that make the big data revolution possible—tools like Hadoop, Cassandra, Storm, Spark/Shark, and Drill—and how they extend the data science toolkit.

Track Hosts

Beau Cronin co-founded two startups based on probabilistic inference, the second of which was acquired by Salesforce in 2012. He now works there as a product manager. He received his PhD in computational neuroscience from MIT in 2008.

Paco Nathan, O'Reilly author ("Enterprise Data Workflows with Cascading"), an evangelist for Apache Mesos; and a "player/coach" who's led innovative Data teams building large-scale apps for 10+ yrs. Expert in machine learning, cluster computing, and Enterprise use cases for Big Data. Interests: Mesos, PMML, Cascalog, Scalding, Python for analytics, NLP.

Add to your personal schedule
GA Ballroom K
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Sameer Agarwal (UC Berkeley), Tathagata Das (Databricks), Ali Ghodsi (UC Berkeley), Ion Stoica (UC Berkeley), Ameet Talwalkar (Databricks), Reynold Xin (Databricks), Matei Zaharia (Databricks), Joseph Gonzalez (UC Berkeley)
Average rating: ****.
(4.29, 7 ratings)
3-Hours: An introduction to the newest components of the open-source Berkeley Data Analytics Stack (BDAS) in development at UC Berkeley (and an overview of existing ones). BlinkDB is a SQL engine that provides fast approximate distributed query results. MLbase includes a library to make machine learning at scale easy. Tachyon is a file system that provides memory speed sharing across frameworks.. Read more.
Add to your personal schedule
SOLD OUT
Ballroom F
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
John Akred (Silicon Valley Data Science), Richard Williamson (Silicon Valley Data Science), Stephen OSullivan (Silicon Valley Data Science)
Average rating: ***..
(3.27, 22 ratings)
3-Hours: What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop and big data ecosystem fit together in production to create a data platform supporting batch, interactive and realtime analytical workloads. Read more.
Add to your personal schedule
Ballroom H
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Rich Raposa (Hortonworks)
Average rating: ****.
(4.30, 10 ratings)
This workshop provides a detailed discussion of the new features of Apache Hadoop 2.0. We will discuss how YARN turns Hadoop from a single use system for batch data processing into a multi-use platform for storing and processing data in many ways other than batch. We will also discuss the details of the new HDFS improvements like High Availability, Federation, and Snapshots. Read more.
Add to your personal schedule
GA Ballroom K
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Andy Konwinski (Databricks), Sameer Agarwal (UC Berkeley), Tathagata Das (Databricks), Ameet Talwalkar (Databricks), Shivaram Venkataraman (UC Berkeley), Patrick Wendell (Databricks), Reynold Xin (Databricks), Matei Zaharia (Databricks), Joseph Gonzalez (UC Berkeley), Haoyuan Li (Tachyon Nexus, Inc.)
Average rating: ***..
(3.10, 10 ratings)
3-Hours: Get hands-on training with the newest components of the open-source Berkeley Data Analytics Stack (BDAS). Lessons will cover BlinkDB, MLbase, Spark, Spark Streaming, and Shark. We will provide each audience member with an EC2 cluster and walk through hands-on exercises using these technologies to analyze real-world datasets. Read more.
Add to your personal schedule
SOLD OUT
Room 204
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Florian Leibert (Mesosphere), Paco Nathan (O'Reilly Media), Benjamin Hindman (Apache Mesos)
Average rating: ****.
(4.40, 5 ratings)
3-Hours: Mesos is a cluster manager that provides efficient resource isolation for distributed frameworks--much like Google's "Borg" for warehouse scale computing. We'll provide hands-on experience in how to build scalable, fault-tolerant data workflows atop Mesos. We'll use Chronos to orchestrate Hadoop jobs and other data prep, then use Marathon to launch a Rails + Redis app to serve results. Read more.
Add to your personal schedule
Ballroom H
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Ronan Stokes (Cloudera)
Average rating: *....
(1.30, 20 ratings)
3-Hours: Apache HBase is a distributed, column-oriented, key-value store for Apache Hadoop (via integration with HDFS). In this tutorial, you will learn the basic elements of building a real-time application that uses Apache HBase as a persistent data store. Read more.
Add to your personal schedule
GA Ballroom J
Edd Dumbill (Silicon Valley Data Science)
Average rating: ***..
(3.76, 17 ratings)
A maze of twisty databases, all of which look the same, and each claim they're best for the job. Welcome to the world of choosing big data vendors. In this session we'll map out the data tool landscape, and lay out a framework to help you choose a solution, or elect to build one yourself. Read more.
Add to your personal schedule
GA Ballroom J
Joseph Adler (Interana, Inc.), Xin Fu (LinkedIn Corporation), Bee-Chung Chen (LinkedIn, Inc.)
Average rating: ****.
(4.00, 17 ratings)
This talk describes how LinkedIn's engineering, data science, and reporting teams work together to develop, test, and rank new insights, recommendations, and updates shown on our home page stream. Read more.
Add to your personal schedule
GA Ballroom J
Soam Acharya (Altiscale), Charles Wimmer (Altiscale), David Chaiken (Altiscale)
Average rating: **...
(2.33, 6 ratings)
The growing popularity of Hadoop has led to an increasing number of clusters worldwide. Priming these clusters with data from existing client repositories is difficult due to a number of issues including data size, network constraints, security & lack of domain knowledge. In this talk, we present a number of techniques & best practices for uploading large amounts of data to remote Hadoop clusters. Read more.
Add to your personal schedule
GA Ballroom J
Adam Fuchs (Sqrrl)
Average rating: ***..
(3.57, 7 ratings)
Apache Accumulo has evolved from a niche government project to a key component of the Hadoop ecosystem with adopters across a variety of industries. One important differentiator for Accumulo is the concept of "cell-level security." Learn how to properly implement cell-level security concepts from the former technical director of the Accumulo project at NSA. Read more.
Add to your personal schedule
GA Ballroom J
Rachel Poulsen (Silicon Valley Data Science), John Akred (Silicon Valley Data Science)
Average rating: ***..
(3.43, 7 ratings)
Design of Experiments (DOE) is a scientific approach to understanding causality using data collection and applied statistical techniques. Through a series of relevant case studies, this session will review the “design” and the “experiment” side of DOE, including systematic data collection and basic statistical applications, and discuss relevant applications beyond A/B testing websites. Read more.
Add to your personal schedule
GA Ballroom J
Patrick McFadin (Datastax)
Average rating: ****.
(4.67, 3 ratings)
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. This talk will give an overview of the many ways you can be successful. Read more.
Add to your personal schedule
GA Ballroom J
Shrikanth Shankar (Qubole Inc.)
Average rating: ****.
(4.25, 4 ratings)
Shrikanth Shankar, Qubole’s VP of Engineering, shares his best practices for building high-performance, scalable queries and deploying User Defined Functions (UDFs) to Big Data applications in Apache Hive. For data analysts and data scientists in the trenches, this is a key session to attend. Read more.
Add to your personal schedule
GA Ballroom J
Vinod Kumar Vavilapalli (Hortonworks)
Average rating: ****.
(4.25, 4 ratings)
The Hadoop 2.0 revolution is in full force! Organizations, companies, users are gearing up for the move from 1.0 to 2.0. In this talk, we will discuss what Hadoop 2.0 is about, what YARN is, what features that HDFS2 unlocks and what it means to move to 2.0. We'll discuss this major migration from 1.0 to 2.0 from various perspectives - admins, frameworks, end users & data processing platforms. Read more.
Add to your personal schedule
GA Ballroom J
Felienne Hermans (Delft University of Technology)
Average rating: ****.
(4.40, 5 ratings)
Spreadsheets are used extensively in industry: they are the number one tool for financial analysis. But they are as easy to build, as they are difficult to analyze, maintain and check. Felienne’s research aims at developing methods to support spreadsheet users to understand, update and improve spreadsheets. Read more.
Add to your personal schedule
GA Ballroom J
Marcel Kornacker (Cloudera, Inc.)
Average rating: ****.
(4.57, 7 ratings)
Learn how and why it is now possible for Apache Hadoop to serve as a virtual Enterprise Data Warehouse (EDW) framework for native Big Data (stored in HDFS) - making it no longer necessary to move that data into the EDW at great expense simply for analysis. In this session, attendees will get an architect-level view of the solution and explore an example configuration and benchmark numbers. Read more.
Add to your personal schedule
Ballroom CD
Avery Ching (Facebook)
Average rating: ****.
(4.25, 4 ratings)
Analyzing graphs can lead to useful insights that drive product and business decisions. This talk describes our efforts at Facebook to scale Apache Giraph to very large graphs (up to one trillion edges) and how we run Apache Giraph in production. We will also talk about how to build applications, some of the algorithms that we have implemented, and their use cases. Read more.
Add to your personal schedule
GA Ballroom J
Reynold Xin (Databricks), Sameer Agarwal (UC Berkeley)
Average rating: ***..
(3.50, 6 ratings)
BlinkDB is an approximate query engine that answers queries in seconds on extremely large datasets by leveraging data sampling. It exploits advances in machine learning and distributed query processing to allow trading off response times and accuracy. BlinkDB is being integrated into Shark and Presto. We will cover real world use case scenarios of BlinkDB at adopters such as Facebook. Read more.
Add to your personal schedule
GA Ballroom J
Paco Nathan (O'Reilly Media)
Average rating: ****.
(4.00, 4 ratings)
Google "Omega" research: 80% cluster jobs are batch, 60% cluster resources go to services. Batch is simple, services are hard, mixing workloads is key to building efficient distributed apps. This talk examines case studies of Mesos workloads: ranging from Twitter (100% on prem) to Airbnb (100% cloud). How did they leverage "data center OS" building blocks for orders of magnitude gains at scale? Read more.
Add to your personal schedule
GA Ballroom J
Rahul Pathak (Amazon Web Services)
Average rating: ****.
(4.00, 3 ratings)
Learn how AWS thinks about big data and how we and our customers have approached managing large datasets using services such as Amazon S3, Amazon Elastic MapReduce, Amazon DynamoDB, and Amazon Redshift. Read more.
Add to your personal schedule
GA Ballroom J
Matvey Arye (Princeton University/Cloudflare), Albert Strasheim (CloudFlare)
Average rating: ****.
(4.00, 1 rating)
Big-data is evolving. The state of the art has gone from running large batch queries over static data sets updated rarely to handling high-velocity data with low processing latency. In this session we present a new data framework that is geared at processing data with a very high update frequency. The framework utilizes the Go language's advanced concurrency primitives and extensibility. Read more.