Skip to main content
Hadoop & Beyond

Hadoop & Beyond

This track goes beyond the basic Hadoop platform, to look at other database models such as Cassandra and Mongo; cloud data services like BigQuery and Redshift; emerging innovations like Spark, Shark and BDAS; graph analytics; and realtime data processing.
Who should attend: Architects and academics looking to push the envelope; scientists and developers facing large-scale, domain-specific challenges that tax existing models and methods.

Track Hosts

Justin Borgman is Co-Founder and CEO of Hadapt. Prior to Hadapt, Justin led product development for COVECTRA, an anti-counterfeit technology firm. Before that, Justin founded an online social media company and spent the first six years of his career as a software developer at MIT Lincoln Laboratory and Raytheon.

Ted Dunning has been involved with a number of startups with the latest being MapR Technologies where he is Chief Application Architect working on advanced Hadoop-related technologies. He is also a PMC member for the Apache Zookeeper and Mahout projects. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.

Reynold Xin is an Apache Spark committer and the lead developer for Shark and GraphX, two computation frameworks built on top of Spark. He is also a co-founder of Databricks. Before Databricks, he was pursuing a PhD focusing on large scale data systems in the UC Berkeley AMPLab.

Add to your personal schedule
Grand Ballroom West
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Israel Ekpo (Walt Disney Parks and Resorts Online)
Average rating: *....
(1.19, 47 ratings)
This is a 3-hour tutorial on how to use Apache Flume to aggregate massive quantities of structured or unstructured data from sources such as log data, click streams, social media data, graph data and network traffic into centralized data stores such as HDFS, ElasticSearch, Neo4j and MongoDB so that they can be processed, digested and visualized in realtime using D3.js and HTML5 WebSockets. Read more.
Add to your personal schedule
Rhinelander South
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Tathagata Das (University of California Berkeley), Haoyuan Li (UC Berkeley), Ion Stoica (UC Berkeley), Reynold Xin (Databricks), Sameer Agarwal (UC Berkeley)
Average rating: ****.
(4.80, 10 ratings)
An introduction to the open-source Berkeley Data Analytics Stack (BDAS). Spark is a high-speed cluster computing engine that supports rich analytics (e.g. machine learning) and lower-latency processing (e.g. streaming). Tachyon provides in-memory storage, letting Spark and Hadoop jobs share data efficiently. Shark and GraphX provide high-speed Hive SQL queries and graph processing on top of Spark. Read more.
Add to your personal schedule
Regent Parlor
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Patricia Gorla (The Last Pickle)
Average rating: ***..
(3.00, 12 ratings)
Before you analyze your big data, you need a way to store and access it. Here we examine the benefits of using a highly-available, eventually consistent storage system, and what impact this has on real-time analytics. This session will prepare you to set up a multi-node working Cassandra and Hadoop cluster. Read more.
Add to your personal schedule
Nassau Suite
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Leah Hanson (Google)
Average rating: ****.
(4.00, 1 rating)
Julia is a high-performance, open source language with great tools for numerical and statistical work. If you know R, MATLAB, or NumPy, you will feel at home in Julia. Unlike these systems, however, Julia takes advantage of modern compiler technology, combining an intuitive programming model with the speed of a low-level language. This workshop will take you from installed to productive in Julia. Read more.
Add to your personal schedule
Gramercy Suite
Ahmed Radwan (Google's Motorola Mobility)
Average rating: ****.
(4.20, 5 ratings)
Multi-tenancy is a reality for large-scale data systems, but it poses concerns about exposure of sensitive data. Using anonymization techniques, sensitive data can be protected in ways that maintains user privacy while preserving the ability to use the data effectively for operational needs. In this talk, we explore the challenges and lessons learned in building solutions for data anonymization Read more.
Add to your personal schedule
Gramercy Suite
Average rating: ***..
(3.00, 11 ratings)
There is increasing demand to discover and explore data iteratively, interactively, and for real-time insights, which we lump together under the term Real-Time Analytical Processing (RTAP). This talk presents our efforts and experience on building the real-time analytical processing framework for several large websites, leveraging Spark and Shark research from UC Berkeley. Read more.
Add to your personal schedule
Gramercy Suite
Julien Le Dem (Twitter), Nong Li (Cloudera)
Average rating: ****.
(4.50, 10 ratings)
Parquet is a columnar file format for Hadoop that brings performance and storage benefits. It supports deeply nested data structures and is easy to extend and integrate with existing type systems. Read more.
Add to your personal schedule
Gramercy Suite
Adam Fuchs (Sqrrl)
Average rating: **...
(2.75, 12 ratings)
The National Security Agency works with some of the world’s largest, most complex, and most sensitive datasets. In order to analyze this data, NSA has developed some powerful tools, such as Apache Accumulo. Come learn about NSA’s key lessons learned about building a Big Data platform from the former Technical Director of the Accumulo project at the NSA. Read more.
Add to your personal schedule
Gramercy Suite
Ari Gesher (Palantir Technologies), Danielle Kramer (Palantir Technologies)
Average rating: ***..
(3.67, 3 ratings)
AtlasDB is a bolt-on layer for a key-value stores (distributed or otherwise) that implements MVCC and guarantees ACID properties for eventually-consistent data stores. In this talk, we'll take a look at the protocol used to implement the transactions, talk about the performance tradeoffs from using transactions, and look at the transactions API it offers. Read more.
Add to your personal schedule
Murray Hill Suite
Colin Marc (Stripe)
Average rating: ***..
(3.33, 3 ratings)
Most startups don't start to think about having a real analytics platform until it's too late, and Stripe is certainly no exception. In this session, I'll describe how we approached bulding such a platform, and walk through the steps (and missteps) we took in making our production data available in Hadoop - in realtime - for processing and querying. Read more.
Add to your personal schedule
Murray Hill Suite
Carlos Guestrin (GraphLab Inc.), Joseph Gonzalez (UC Berkeley)
Average rating: ****.
(4.86, 7 ratings)
GraphLab is like Hadoop for graphs. Users express graph processing algorithms using a simple API and the GraphLab runtime efficiently executes that computation on multicore and distributed architectures. By leveraging advances in graph representation, asynchronous communication, and scheduling, GraphLab is able to achieve orders-of-magnitude performance gains over existing systems like Hadoop. Read more.
Add to your personal schedule
Murray Hill Suite
Dave Stokes (MySQL Community Team)
Average rating: **...
(2.40, 5 ratings)
MySQL 5.6 includes a NoSQL interface, using an integrated memcached daemon that can automatically store data and retrieve it from InnoDB tables, turning the MySQL server into a fast “key-value store” for single-row insert, update, or delete operations. This session explores using this interface and other 'simple' options for those with MySQL Databases instances seeking to explore big data access. Read more.

Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners
@oreilly.com

Press & Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts