Search and Real-time Analytics on Big Data

Sewook Wee (Accenture), Ryan Tabora (Think Big Analytics), Jason Rutherglen (Datastax)
Hadoop: Case Studies Hadoop: Tools & Technology, Murray Hill (NY Hilton)
Tutorial Please note: to attend, your registration must include Tutorials.
Average rating: *....
(1.80, 5 ratings)

In this hands-on tutorial, you will learn the importance of distributed search by our industry experience and a specific example. In particular, we’ll introduce the architecture that incorporates distributed search techniques, share pain points experienced and lessons learned. Building atop, we’ll depict the landscape of distributed search tools and their future directions. For the hands-on part of the tutorial, you will learn how to install and use Apache Solr for real-time Big Data analytics, search, and reporting. You’ll also learn some tricks of the trade and how to handle known issues.

  • Distributed Search – applications in industry
    • Search applications for structured data
    • Search applications for unstructured data
    • Geo-indexed search
    • Why distributed search? What happens as index size grows with data?
  • Use Case: Log Data
    • Requirements
      • Petabytes of semi-structured log data
      • Billions of parsed Solr documents
    • Apache Solr
      • Schema
      • Backups
      • Disaster Recovery
      • Hbase splitting/region allocation
      • Duplicates
      • Joins
  • Technology Landscape
    • Solr/Lucene out of the box
    • Integrated Solr Solutions
      • Datastax Enterprise (DSE) 2.0
      • Lily
      • Lucid Imagination
      • Kitenga
      • Katta
    • Non-Solr Solutions
      • Riak
      • MongoDB
      • Amazon cloud search
      • Google BigQuery
  • Using the Big Data Search Tutorial Tools

We’ll email instructions to you before the tutorial so you can come prepared with the necessary tools installed and ready to go. This prior preparation will let us use the whole tutorial time to learn some of the fundamentals of the Lucene query language and other important topics. At the beginning of the tutorial we’ll show you how to use these tools.

  • Amazon AWS instance pre set up or user setup
  • Setup Solr cores
    • Install
    • Load
  • Setup HBase
    • Install
    • Load
  • Writing Lucene Queries

We’ll spend most of the tutorial using a series of hands-on exercises with actual Lucene queries, so you can learn by doing. We’ll go over all the main features of Lucene’s query language, and how Lucene works with data in Hadoop.

  • Advanced Techniques

This section will cover advanced topics such as relevance ranking, facets, group by, sort by, and other important features for Big Data search projects. Lucene enables many types of customizations of the underlying technology.

  • Lucene / Solr in the Hadoop Ecosystem

We’ll conclude with a discussion of Lucene’s place in the Hadoop ecosystem, such as how it compares to other available tools. We’ll discuss installation and configuration issues that ensure the best performance and ease of use in a real production cluster. In particular, we’ll discuss how to create an efficient Lucene secondary index on data stored in HBase, Cassandra, and other NoSQL databases.

Photo of Sewook Wee

Sewook Wee

Accenture

Sewook Wee is an R&D manager at Accenture Technology Labs. His research has been grounded on distributed system with current emphasis on Big Data platform technologies. Recently, he led Hadoop deployment comparison study where he compared bare metal Hadoop cluster with Hadoop services (Amazon EMR) at the total cost of ownership level with three real world workloads. Previously, he has led various R&D projects including hybrid NoSQL approaches that layers graph data management capability on column-oriented datastores; MapReduce-based data transformation framework; next generation software architecture that maximizes the benefits of cloud; MonteCloudo, elastic Monte Carlo simulation architecture using cloud; and web server farm architecture on AWS EC2 environment. Along with leading R&D projects, he publishes academic papers, business white papers, files patents, presents in both academic and industry conferences, builds relationships with business partners and clients. He received MS and PhD degrees from Stanford University, and his alma mater is Seoul National University in South Korea.

Photo of Ryan Tabora

Ryan Tabora

Think Big Analytics

Ryan is a data engineer at Think Big Analytics. He leads technical consulting projects for big data implementations at Fortune 500 clients. He has in depth experience working with Solr/Lucene and the Hadoop stack.

Jason Rutherglen

Datastax

Jason is a Sr. Architect at Think Big Analytics. He has many years of experience writing Java application software, most recently for Hadoop-based applications.

Comments on this page are now closed.

Comments

Picture of Ryan Tabora
Ryan Tabora
11/26/2012 2:48pm EST

Hi all.

If you want the slides please find them hosted by Strata’s website

cdn.oreillystatic.com/en/as...

cdn.oreillystatic.com/en/as...

The bitly link I gave below will no longer work. If you are looking for the exercises please feel free to contact me at ryan.tabora@thinkbiganalytics.com and I can send you the material directly.

I wanted to say thank you to everyone who reviewed the presentation, we are making some changes to the 2013 material based on the feedback.

Thanks!

Ryan

Picture of Ryan Tabora
Ryan Tabora
10/24/2012 8:37am EDT

Hi all,

You can download the slides at bit.ly/stratasearchslides

Thanks, Ryan

Picture of Ryan Tabora
Ryan Tabora
10/23/2012 7:18am EDT

Hi all,

You can download the content at bit.ly/stratasearch.

If you want to follow along with a PC, please have putty installed. Mac/Linux users only need a terminal with SSH.

Thank you, Ryan

Picture of Ophir Cohen
Ophir Cohen
10/22/2012 9:14pm EDT

I also didn’t get the instructions, can you send to me as well: ophirc@liveperson.com Thanks!

Ibrahim Ulukaya
10/22/2012 7:31pm EDT

I haven’t received the instructions email, maybe because I registered late. Can you email me them? ulukaya@gmail.com Thanks.

alex HONG
10/22/2012 4:51pm EDT

Can you share the slides online?

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities, contact Susan Stewart at sstewart@oreilly.com.

Media Partner Opportunities

For information on trade opportunities contact Kathy Yu at mediapartners
@oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts.