HBase is one of the new NoSQL data stores that have come up in the recent years and has been gaining popularity at a fast pace. It is a true open source implementation of the Google Bigtable, and is a part of the Hadoop ecosystem. HBase is known to scale to 100s of nodes easily, providing fast random access to terabytes and petabytes of data. This tutorial is to get you started in the world of HBase so you can build a scalable application of your own.
We’ll accomplish this by covering the following aspects:At the end of the tutorial, you’ll have an understanding of how to build applications that use HBase as the backend store.
Requirement: Make sure to come with your laptops (Mac / Linux or access to an EC2 instance) and if possible, download HBase 0.94.1 tar ball from the apache website (http://hbase.apache.org) so we can get to work right away. The tutorial includes hands-on exercises.
Amandeep is a Solutions Architect at Cloudera where he’s involved in the entire lifecycle of Hadoop adoption for customers – from use case discovery to taking systems to production. Amandeep is also a co-author of HBase In Action, a book geared towards building applications using HBase. Prior to Cloudera, Amandeep was at Amazon Web Services, where he was a part of the Elastic MapReduce team and built the first version of EMR’s HBase offering.
Software Engineer at Cloudera, currently focused on the Apache HBase project.
Comments on this page are now closed.
For information on exhibition and sponsorship opportunities, contact Susan Stewart at sstewart@oreilly.com.
For information on trade opportunities contact Kathy Yu at mediapartners
@oreilly.com
For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com
View a complete list of Strata contacts.
Comments
Would the Cloudera CDH3 version be OK?
Name : hadoop-hbase Version : 0.90.6+84.73 Repo : cloudera-cdh3
Are there any best practices for serving low-latency random reads from HBase using a cluster that is simultaneously running a lot of MapReduce jobs? More specifically, how do you keep the MapReduce jobs from creating intermittent, large spikes in read latency? Is replication typically the best option for dealing with this?
Matthew, not really. Any linux instance should do fine as long as you are able to connect to it from your laptop. I’d recommend not using EC2 because you’ll need reliable internet connectivity for the period you are doing exercises.
Jack, we’ll work with standalone. You don’t need Hadoop installed. In fact, it’s cleaner to keep Hadoop out of the picture for this tutorial.
Any configuration suggestions for EC2 instances?
Hi Amandeep, are we going to run some examples on top of a pseudo cluster? If we do, does the cluster version matter? I have installed hadoop 1.0.4 but hbase 0.94.1 has a hadoop-core-1.0.3.jar in its lib direcotry. Does this matter? Thanks,
0.94.0 would work just fine and so would 0.92.x. We’ll be doing some basic work with the API and any of those versions would suffice.
I’m just preparing my Mac laptop for the tutorial on Tuesday. I have been using the ‘brew’ package manager to install hadoop and hbase. The latest version of hbase supported by brew currently is 0.94.0. Is there anything critical in the upgrade to 0.94.1 that is needed for this tutorial? If so I could take a stab at updating the brew formula – I think it just involves pointing it to the correct tarball..