Everybody would like to process big data quickly and cheaply but current tools like Hadoop don’t fit the bill. Hadoop’s batch model does’t provide answers in real-time, Map-Reduce is inefficient for many algorithms, and running a cluster of machines can be expensive and time-consuming.
When processing the biggest data the batch model breaks down—there is simply too much data to store it all. To meet this challenge a family of techniques know as streaming or on-line algorithms have been developed. These algorithms use a single pass over data to estimate properties such as the most frequent items, or to learn classifiers and recommendation systems. It turns out that these techniques make a great fit for smaller companies that want big data to be cheap and fast. These algorithms are simple to implement, scale extrordinarily well, and return results in real-time.
In this talk I will discuss several key streaming algorithms:
- The Bloom filter and Count-Min sketch for counting item occurrence - Heavy hitter and quantile algorithms for estimating frequency information - Lock-free stochastic gradient descent for learning classifiers and recommendation systems
Noel has over fifteen years experience in software architecture and development, and over a decade in machine learning and data mining. Examples of the projects he’s been involved with include one of the first commercial products to apply machine learning to the Internet (eventually acquired by Omniture), a BAFTA award winning website, and a custom CMS used daily by thousands of students.
Noel is an active writer, presenter, and open source contributor. Noel has a PhD in machine learning from the University of Birmingham.
For information on exhibition and sponsorship opportunities, contact Susan Stewart at email@example.com or +1 (707) 827-7148
For information on trade opportunities contact Kathy Yu at mediapartners
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
View a complete list of Strata contacts.