The Two Most Important Algorithms in Predictive Modeling Today

Jeremy Howard (Kaggle), Mike Bowles (Biomatica)
Data Science, Ballroom CD
Please note: to attend, your registration must include Tutorials.
Average rating: ****.
(4.44, 9 ratings)

When doing predictive modelling, there are two situations in which you might find yourself:

  • You need to fit a well-defined parameterised model to your data, so you require a learning algorithm which can find those parameters on a large data set without over-fitting
  • You just need a “black box” which can predict your dependent variable as accurately as possible, so you need a learning algorithm which can automatically identify the structure, interactions, and relationships in the data

For case (1), lasso and elastic-net regularized generalized linear models are a set of modern algorithms which meet all these needs. They are fast, work on huge data sets, and avoid over-fitting automatically. They are available in the “glmnet” package in R.

For case (2), ensembles of decision trees (often known as “Random Forests”) have been the most successful general-purpose algorithm in modern times. For instance, most Kaggle competitions have at least one top entry that heavily uses this approach. This algorithm is very simple to understand, and is fast and easy to apply. It is available in the “randomForest” package in R.

Mike and Jeremy will explain in simple terms, using no complex math, how these algorithms work, and will also explain using numerous examples how to apply them using R. They will also provide advice on how to select from these algorithms, and will show how to prepare the data, and how to use the trained models in practice.

Now Available

Photo of Jeremy Howard

Jeremy Howard

Kaggle

Jeremy Howard is President and Chief Scientist at Kaggle. Previously, he founded FastMail (sold to Opera Software) and Optimal Decisions sold to ChoicePoint – now called LexisNexis Risk Solutions). Prior to that he worked in management consulting, at McKinsey & Company and A.T. Kearney. Jeremy’s passion is applying algorithms to data. At FastMail he used algorithms to automate nearly every part of the business – as a result the company only needed a total of 3 full time staff, and got over a million signups. Optimal Decisions was a business entirely built to commercialise a new algorithm he designed for the optimal pricing of insurance. Jeremy competes regularly in data mining competitions, which he uses to test himself and stay on the leading edge of machine learning and predictive modelling technology. He is currently ranked #1 on Kaggle’s overall competitor rankings, out of over 16,000 data scientists.

Photo of Mike Bowles

Mike Bowles

Biomatica

Dr Mike Bowle’s career is one of the most extraordinary in Silicon Valley. Mike’s career started out in research, as an assistant professor at MIT. He went on to found and run two companies, both of which went on to huge IPOs: First was Com21, an early pioneer in developing cable modem networks, which Mike led to a successful NASDAQ IPO at a $300m valuation. He then went on to create IBeam Broadcasting, a video distribution network, which after just 2.5 years he led to a $3b IPO. More recently he has been active as co-founder and instructor for the series of data mining courses run at Hacker Dojo. These courses are nearly always sold out, and have received great feedback from participants.

Comments on this page are now closed.

Comments

JK LONDON
01/07/2013 6:38am PST

Hi – still very interested in seeing the correct video, so the first one from this series that goes into random forests algo, is there a link?

Picture of Mark Madsen
Mark Madsen
11/11/2012 10:27pm PST

The video link is for the wrong presentation. This was a tutorial, available on the conference DVD/for online viewing. Someone needs to remove the video link from this page. Also, someone should replace the PDF that doesn’t read properly with the DOC Jeremy uploaded on another site.

Carlos Gonzalez
08/21/2012 4:54pm PDT

The video link appears to be incorrect- it points to the session “From Predictive Modelling to Optimization”. Is there a correct link for the video?

Picture of Kathy Yu
Kathy Yu
03/28/2012 4:57pm PDT

@Pedro The video can be seen at youtu.be/vYrWTDxoeGg – we’ve also just updated this session page with the links to the video & the free book.

Pedro Bizarro
03/28/2012 4:19pm PDT

Hi there, will the video be made available? Even if not here, the Strata newsletter announced that one could get Jeremy’s new book and also see the video, but I cannot see the link for the video. Any helpd?

Nathan Wenzel
03/09/2012 11:14am PST

Thanks Jeremy. The doc file comes through cleanly. Just tried the ZIP file again. The PDF still renders with boxes around ”?” where each character should be.Odd that I’m the only one reporting the problem.

Once again, great talk! Thanks Mike and Jeremy.

Picture of Jeremy Howard
Jeremy Howard
03/08/2012 1:11pm PST

That’s odd. The PDF is working for me. I’ve popped the .doc file here: public.jhoward.fastmail.fm/... . Let me know if that works ok for you.

Nathan Wenzel
03/07/2012 9:33pm PST

Anyone else having trouble with the PDF? The text does not display correctly. I just get a bunch of ????.

Joel Hennig
03/02/2012 9:34am PST

Where have the slides/code been posted? Please let me know. This was a great session!!

Picture of Mike Bowles
Mike Bowles
02/29/2012 2:23pm PST

The slides and code for this presentation will be available shortly.

Marcos Sainz
02/28/2012 8:18pm PST

Hi, I was wondering when and where you’ll be posting the slides, code, and data from this talk. Thank you!

Sponsors

  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com.

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

View a complete list of Strata contacts