RHadoop, R meets Hadoop

Antonio Piccolboni (Per data LLC)

Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.

  • rhdfs provides file level manipulation for HDFS, the Hadoop file system
  • rhbase provides access to HBASE, the hadoop database
  • rmr allows to write mapreduce programs in R

rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time — it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.

Photo of Antonio Piccolboni

Antonio Piccolboni

Per data LLC

Antonio Piccolboni is a data scientist with both industrial and academic experience. His recent work includes the design and implementation of a big data analysis package in R, social network analysis for a top 20 global web site and web analytics for a major web ratings company. He is currently an independent consultant with clients including Dataspora and Revolution Analytics. He blogs at blog.piccolboni.info about big data and analytics. His papers have received more than 800 citations and his Erdős number is 3.

Sponsors

  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com.

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

View a complete list of Strata contacts