Strata 2013 Beyond Hadoop Sessions

In this track, we’ll take a deep dive into Cassandra, Storm, Drill, and other emerging technologies that are quickly becoming vital tools in the data scientist’s toolbox.

Track Hosts

Bradford Stephens is the founder and CEO of Drawn to Scale, creators of the Spire database. Spire is a SQL database built on Hadoop and HBase, similar to Google F1. Drawn to Scale has customers powering large web apps, mobile infrastructures, telecoms, social networks, and more. A long-time user of Hadoop and HBase, Bradford has built large infrastructures at various startups and enterprises, and worked on Microsoft SQL Server. He holds degrees in Computer Science and Political Science, and spent several years as a Campaign Manager in politics at the Presidential and U.S. House levels.

Julia Ferraioli is a Developer Advocate working on Google Compute Engine. She helps developers harness the power of Google's infrastructure to tackle their computationally intensive processes and jobs. She comes from an industrial background in software engineering, and an academic background in machine learning and assistive technology.

Add to your personal schedule
Ballroom G
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Ion Stoica (UC Berkeley), Matei Zaharia (Databricks), Reynold Xin (Databricks), Shivaram Venkataraman (UC Berkeley), Andy Konwinski (UC Berkeley), Tathagata Das (Databricks)
Average rating: *****
(5.00, 3 ratings)
An introduction Spark and Shark, two components of the open-source Berkeley Data Analytics Stack (BDAS) in development at UC Berkeley. Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100x. Shark is a port of Apache Hive onto Spark that is fully compatible with, and up to 100x faster than, Hive. Read more.
Add to your personal schedule
Ballroom AB
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Ryan Tabora (Think Big Analytics), Jason Rutherglen (Datastax)
Average rating: **...
(2.31, 13 ratings)
In this hands-on tutorial, you will learn the importance of distributed search by our industry experience and knowledge of real use cases. We’ll introduce different architectures that incorporate distributed search techniques, share pain points experienced and lessons learned. For the hands-on part of the tutorial, you will learn how to install and use Apache Solr for real-time search on big data. Read more.
Add to your personal schedule
Ballroom H
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Ryan Boyd (Google), Michael Manoochehri (Google, Inc.), Julia Ferraioli (Google)
Average rating: **...
(2.57, 7 ratings)
When data volume and velocity become massive, processing and analysis solutions require specialized technologies for different parts of the data pipeline. Google’s Cloud Platform is designed to help you focus on building applications, not infrastructure. We’ll demonstrate how to build end to end Big Data applications - from data collection, to analysis, to reporting and visualization. Read more.
Add to your personal schedule
Room 204
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Matei Zaharia (Databricks), Reynold Xin (Databricks), Andy Konwinski (UC Berkeley), Tathagata Das (Databricks), Patrick Wendell (Databricks)
Average rating: ****.
(4.00, 1 rating)
Building on our previous tutorial introducing BDAS, the open-source Berkeley Data Analytics Stack, in this tutorial we will provide each audience member with a Spark/Shark cluster on EC2 and walk through hands-on coding examples. Lessons will cover the Spark and Shark command line interfaces, writing a standalone program, and data clustering using a distributed machine learning algorithm on Spark. Read more.
Add to your personal schedule
Great America Ballroom J
Sharmila Shahani-Mulligan (ClearStory Data), Matei Zaharia (Databricks), Stephanie McReynolds (ClearStory Data)
Average rating: ****.
(4.00, 2 ratings)
AMPLab’s open source data analysis projects, Spark and Shark, deliver iterative queries up to 100x faster than Hadoop MapReduce. Hear how companies are using Spark-based data platforms for fast, interactive analysis on big data. Read more.
Add to your personal schedule
Great America Ballroom J
Tomer Shiran (Apache Foundation/MapR)
This session is an overview of Apache Drill, another big data system inspired by a Google white paper. Read more.
Add to your personal schedule
Great America Ballroom J
Bahman Bahmani (Stanford University)
Average rating: ****.
(4.25, 4 ratings)
In many modern web and big data applications the data arrives in a streaming fashion and needs to be processed on the fly. Due to the size of data, the computations need to be done incrementally, and hence sketches of data are used that take a small amount of memory but allow for fast updates and queries. We will present the techniques to design these sketches and provide clarifying examples. Read more.
Add to your personal schedule
Ballroom AB
Brian Granger (Cal Poly San Luis Obispo)
Average rating: ****.
(4.20, 5 ratings)
In this talk, I will introduce the IPython Notebook, an open-source, web-based interactive computing environment for Python and other languages. By enabling the data scientist to build documents that combine code, text, formulas, visualizations, images and video the Notebook creates a foundation for data science that is interactive, repeatable, documented and sharable. Read more.
Add to your personal schedule
Great America Ballroom J
Mauricio Vacas (Accenture Technology Labs), Fausto Inestroza (Accenture Technology Labs), Sonali Parthasarathy (Accenture Technology Labs)
Average rating: ***..
(3.00, 2 ratings)
With the growth in volume and velocity of data, businesses need a scalable solution alongside batch processing to process events on the fly and provide real time insights. In this session, we will describe how we used Storm to analyze network data to detect causes of network performance degradation. Read more.
Add to your personal schedule
Great America Ballroom J
Tim O'Brien (O'Reilly Media)
Average rating: *****
(5.00, 7 ratings)
While the industry has been busy abandoning the relational database and calling it a fundamentally limited technology, several trends are conspiring to revive the good old RDBMS. While it might not resemble the MySQL or Oracle database you are running today, this talk will explore how hardware trends, software trends, and industry research are point to SQL, structure, and ACID at scale. Read more.
Add to your personal schedule
Great America Ballroom J
Jim Kelly (Quantcast)
This talk introduces an open-source distributed file system that will double the capacity of your Hadoop cluster and speed up your MapReduce jobs. The talk will describe the Reed-Solomon implementation and its implications for cluster performance, how it leverages the speed of modern networks to achieve better storage efficiency and make Hadoop jobs run faster. Read more.
Add to your personal schedule
Great America Ballroom J
Stephan Ellner (Google), Jeff Shute (Google)
Average rating: *****
(5.00, 5 ratings)
Many of the services that are critical to Google’s ad business have historically been backed by MySQL. We have recently migrated several of these services to F1, a new RDBMS developed at Google. F1 implements rich relational database features, including a strictly enforced schema, a powerful parallel SQL query engine, general transactions, change tracking and notification, and indexing. Read more.
Add to your personal schedule
Great America Ballroom J
C. Aaron Cois (Carnegie Mellon University, Software Engineering Institute), Tim Palko (Carnegie Mellon University, Software Engineering Institute)
Average rating: ****.
(4.67, 3 ratings)
In this talk, we describe using Redis, an open source, in-memory key-value store, to capture large volumes of data from numerous remote sources while also allowing real-time monitoring and analytics. With this approach, we were able to capture a high volume of continuous data from numerous remote environmental sensors while consistently querying our database for real time monitoring and analytics. Read more.
Add to your personal schedule
Great America Ballroom J
John A. De Goes (Precog)
This talk discusses the market needs that are giving birth to the "scientific database", what these systems have to offer that is currently lacking in either the data management or statistical worlds, and how scientific databases will co-exist and co-evolve with Hadoop and other leading big data platforms. Read more.
Add to your personal schedule
Great America Ballroom J
Justin Erickson (Cloudera)
Average rating: ****.
(4.33, 3 ratings)
The Cloudera Impala project is for the first time making scalable parallel database technology, which is the underpinning of Google's Dremel as well as that of commercial analytic DBMSs, available to the Hadoop community. Read more.
Add to your personal schedule
Great America Ballroom J
Eric Tschetter (Metamarkets), Danny Yuan (Netflix Platform Engineering Team)
This talk will discuss how Druid allows users to have interactive queries on real-time data at scale; we feature a case study with Netflix leveraging Druid to obtain at-the-moment insight as it ingests over two terabytes per hour. Read more.

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts