Data Bootcamp

Joseph Adler (LinkedIn, Inc.), Hilary Mason (bitly), Drew Conway (IA Ventures), Jake Hofman (Yahoo!)
Practitioner
Location: Mission City M
Average rating: ***..
(3.17, 29 ratings)

This tutorial offers a basic introduction to practicing data science. We’ll walk through several typical projects that range from conceptualization to acquiring data, to analyzing and visualizing it, to drawing conclusions.

We assume familiarity with the command line and the ability to use libraries and code.

Topics covered include:

  • Data acquisition and cleaning
  • Building practical data storage, analysis, and production systems
  • Visualizing data for exploration and presentation
  • Learning from data
  • Building a data science team
  • Privacy and security issues

Tutorial outline

  • Introductions and admin
    • Who we are, why we love data
    • Motivations for why now is the time to learn “data science”
  • Working with image data
    • We will discuss examples of how to classify and cluster images based on color values and intensities, introduce the K-nearest neighbor approach to this classification, and show visualizations of image data histograms.
    • Primary concepts
      • Acquiring data from APIs
      • Feature space representations
      • Supervised learning: K-nearest neighbors classification
      • Unsupervised learning: K-means clustering
    • Examples
    • Visualization
      • Color intensity histograms
      • k-nearest neighbor plots
  • Working with text data
    • During this section we will cover acquiring and cleaning semi-structured e-mail and LinkedIn data (via CSV export). We will explore both data sets at the command-line, merge the data and test some basic machine learning concepts on the data. Throughout, various visualization techniques will be used to explore the data; particularly the social graph that arises from e-mail.
    • Primary concepts
      • Acquisition and cleaning of text data
      • Merges and joins
      • Classification via Naive Bayes, and why k-nearest does not work for text
    • Data (email and LinkedIn contacts)
    • Visualizations
      • Time-series of emails (periodicity)
      • Frequent correspondents
      • Heat map for LinkedIn contacts in U.S.
      • E-mail graph
      • Geo->IP->Map plots
  • Big Data
    • Data Storage
      • Flat files
      • Relational databases (SQL)
      • Other databases (NoSQL)
    • Processing
      • Serial
      • Parallel (Hadoop, MPI, etc)
    • Where to keep your data
      • Machines you own/control
      • The cloud
    • Privacy and Security Considerations
      • Why you should care
      • Quick overview of applicable laws
  • Data mashups
  • Concluding panel discussion
    • Building a data science team
    • General Q&A

Software Requirements

For those Data Bootcamp participants that wish to follow along with the instructors there are several software tools that you will need to have pre-installed. If you do not wish to practice during the session then it is not necessary to have these tools installed prior to bootcamp, but you will need them to replicate the methods described on your own.

For those running a UNIX distribution or Mac OS X all of the base tools (bash, Python, and R) are already installed, so you will only need to make sure that you have the supporting packages listed below. For Windows users you will need to install the tools separately from binaries, which you can download at the following sites:

- To use these command-line tools we recommend you install Cygwin to emulate a UNIX-like environment.
- Python: http://www.python.org/download/windows/
- R: http://cran.r-project.org/bin/windows/base/
  1. UNIX bash ###

A large part of analyzing data is dealing with structured and unstructured text. As such, there are several command-line tools that allow for “quick and dirty” handling of this data. For this tutorial we will rely on the following set, which come with any UNIX-like distribution:

- sed
- awk
- grep
  1. Python ###

Python is a powerful high-level scripting language that is well suited for manipulating and analyzing data of all kinds. There are a number of Python libraries for analyzing data, but for this tutorial we will focus on the following:

- email: For parsing email data
- Natural Language Toolkit (NLTK):  Powerful set of tools for performing natural language processing on text
- NumPy, SciPy, matplotlib: A trio of scientific computing libraries in Python that provide data types and functions for numeric and statistical analysis, as well as visualization
- Python Image Library (PIL): For the statistical analysis of image data
- NetworkX: For the creation, manipulation, and study of the structure, dynamics, and functions of complex networks

There are a few ways to install Python packages, but we recommend either of the following. In you Python setuptools installed you can download and install all of the above libraries with the following command:

$ easy_instal {package_name}

For example, to install NetworkX simply type:

$ easy_install networkx

You can also install packages from source by downloading the source files at the sites referenced above. Simply unarchive the source code, navigate to the folder where the source code is located, and use the following command:

$ python setup.py install
  1. R ###

The R statistical programming language has become the de facto lingua franca for statistical analysis. There are thousands of R packages available on CRAN to perform any number of analyses. For the purposes of this tutorial we will use the extremely powerful ggplot2 package by Hadley Wickham for data visualization.

To install packages in R we use the ``install.packages`` command:

install.packages(“ggplot2”, dependencies=TRUE)

Note, ggplot2 requires several other packages, so if you are running a new R installation this may take a few minutes.

  1. Additional Software ###

During the tutorial there will be opportunity to visualize network relationships. A very useful tool for visualizing networks in Gephi, which is a standalone application. If you wish to follow along with this portion of the tutorial please download and install Gephi.

Photo of Joseph Adler

Joseph Adler

LinkedIn, Inc.

Joseph Adler has many years of experience in data mining and data analysis at companies including DoubleClick, American Express, and VeriSign. He graduated from MIT with an B.Sc. and M.Eng in Computer Science and Electrical Engineering. He is the inventor of several patents for computer security and cryptography, and the author of “Baseball Hacks” and “R in a Nutshell”. Currently, he is a principal data scientist at LinkedIn.

Photo of Hilary Mason

Hilary Mason

bitly

Hilary Mason is the Chief Scientist at bit.ly, where she finds sense in vast data sets. Her work involves both pure research and development of product-focused features.

She’s also a co-founder of HackNY, a non-profit organization that connects talented student hackers from around the world with startups in NYC.

Hilary recently started the data science blog Dataists and is a member of hacker collective NYC Resistor.

She has discovered two new species, loves to bake cookies, and asks way too many questions.

Photo of Drew Conway

Drew Conway

IA Ventures

Drew Conway is an expert in the application of computational methods to social and behavioral problems at large-scale. Drew has been writing and speaking about the role of data - and the discipline of data science - in industry, government, and academia for several years. Drew has advised and consulted companies across many industries; ranging from fledgling start-ups to Fortune 100 companies, as well as academic institutions and federal agencies. Drew is a co-founder of DataKind (non-profit connecting social organizations with data scientist), the author of Machine Learning for Hackers (O’Reilly Media, 2012), a co-chair of the DataGotham conference, and is currently serving as the Scientist-in-Residence at IA Ventures. Drew is also completing his doctoral work in the Department of Politics at New York University. Prior to graduate school, Drew worked in the U.S. Intelligence Community in Washington, DC. There, he was an all-sources analyst specializing in the mathematical modeling of social systems.

Photo of Jake Hofman

Jake Hofman

Yahoo!

Jake Hofman is a member of the Human Social Dynamics group at Yahoo! Research. His work involves data-driven modeling of social data, focusing on applications of machine learning and statistical inference to large-scale data. He holds a B.S. in Electrical Engineering from Boston University and a Ph.D. in Physics from Columbia University.

Comments on this page are now closed.

Comments

Picture of Peter Clark
Peter Clark
02/08/2011 5:17pm PST

I really wanted to like this session, but the bootcamp was more of an overview. It tried to cover half-a-semester’s worth of machine learning topics, python, R, and a specific graphics library for R in a day. Way too much material – drinking from a firehose. The presenters clearly did a ton of work getting this all together, but it’s way too much unless you already know the material being covered.

John D Putman
02/07/2011 10:47am PST

Not previewing the formatting and effectiveness / visibility of even “open source tool” produced presentations is inexcusable when that comes out as bad as this [dirty and worn] “boot camp” did.

Picture of jerome cukier
jerome cukier
02/04/2011 12:07am PST

I found the boot camp both fascinating and frustrating. fascinating because it manages to give a glimpse of what’s within reach once one has a grip on fairly basic skills. frustrating as it wasn’t hands-on and therefore I didn’t instantly turn into a data scientist as I had hoped. I would have thought we would have used our computers a bit more or at least that there would be more interaction. I am giving a very positive note nonetheless because the session succeeded in motivating me to learn what I don’t know in my ample free time.

Charles Engelke
02/02/2011 2:19pm PST

Not really a boot camp, because I don’t see how you could work at that level of detail for such a huge group. I found it to be a good survey.

Picture of Ernest Mueller
Ernest Mueller
02/02/2011 8:59am PST

I wasn’t very pleased with the first half of this session (after the first half, I bailed for another session). I wish all the install prerequisites had been made more clear ahead of time. Other session presenters brought USB keys with all the materials to avoid this problem. There were no power strips so people’s computers were giving out, and the wireless was so bad I couldn’t download even the slides and code .zip bundle within a couple hours. And the LaTeX slides were ofte illegible.

Besides the technical snafus, I’m not sure who the talks were targeted to. If you weren’t already skilled at statistics and using R and SciPy, the super fast and cursory overview (liberally spiced with “you’ve all done this before right?) wouldn’t bring anyone up to speed – and if anyone had done it before, then a “hello world” kind of simple example in them wouldn’t give them anything extra either. I would give the feedback that either the talks should be retooled (both in content and logistics) to either be clearly onboarding for beginners or pro tips for experts. The current content falls into the middle field of “neither.”

I appreciate all the hard work the presenters did to put together the session and am sorry to be critical, but I think there needs to be some meta-thought put into this session before re-delivering it in whole or in part.

Picture of Philip Kromer
Philip Kromer
02/01/2011 11:55am PST

@jerome see github.com/drewconway/strat...—also you can ssh to demo@bootcamp.infochimps.com to use a machine with most of what you need already installed

Picture of Drew Conway
Drew Conway
02/01/2011 11:55am PST

@Jerome, you can get all the materials here: github.com/drewconway/strat... or if you just want the slides and code go to bit.ly/campyslides

Jerome Basdevant
02/01/2011 11:29am PST

Hi I arrived a bit late, where could I grab the code examples and slides

Picture of Peter Clark
Peter Clark
02/01/2011 10:17am PST

For mac os/x folk, look at stronginference.com/scipy-s...

Charles Engelke
01/21/2011 7:16am PST

For Windows users, easy_install reports errors installing matplotlib and warnings installing scipy. I went to their websites instead and used the binary installers for Windows, which seems to handle everything okay. (For Python 2.7, you need to go to the beta version of SciPy to download a compatible binary.)

Picture of Elisabeth Robson
Elisabeth Robson
01/11/2011 3:13pm PST

Thanks JB. I’ve installed (or attempted to install) all the libs, about half seem fine; half had errors. Hopefully we can sort out at the boot camp! Might be good to have a little pre-help session for those who might not have the libraries installed correctly.

Picture of J.B. Wheatley
J.B. Wheatley
01/10/2011 5:09pm PST

Elisabeth, we’ve added Software Requirements to the session description above. Let us know if you have any questions!

Picture of Elisabeth Robson
Elisabeth Robson
01/06/2011 12:10pm PST

Should we have anything in particular installed on our machines?

Picture of Drew Conway
Drew Conway
01/04/2011 9:55am PST

@Rob

Each of the sections involves the use of the command-line and writing and running code, but is not the entirety of what will be covered. Active participation is completely up to you, and you are welcome to be involved in as much or as little as you like.

Also, all code examples will be made available on the day of the tutorial for you to take back with you and practice.

Picture of Rob Wiley
Rob Wiley
01/01/2011 4:07pm PST

Can you provide a little more guidance as to how much “familiarity with the command line and the ability to use libraries and code” will be needed for the day to be beneficial? I have experience in both and can easily follow examples, but I have been out of active coding for several years. How much of the day will be active participation vs. examples with reference material for later?

Sponsors

  • Thomson Reuters
  • EMC Data Computing Division
  • EnterpriseDB
  • Microsoft
  • Gnip
  • Rackspace Hosting
  • IBM
  • Windows Azure MarketPlace DataMarket
  • Amazon Mechanical Turk
  • Amazon Web Services
  • Aster Data
  • Cloudera
  • Clustrix
  • DataStax, Inc. (formerly Riptano, Inc.)
  • Digital Reasoning Systems
  • Heritage Provider Network
  • Impetus
  • Jaspersoft
  • Karmasphere
  • LinkedIn
  • MarkLogic
  • Pentaho
  • Pervasive
  • Revolution Analytics
  • Splunk
  • Urban Mapping
  • Wolfram|Alpha
  • Esri
  • ParAccel
  • Tableau Software

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Young at syoung@oreilly.com

Download the Strata Sponsor/Exhibitor Prospectus

Contact Us

View a complete list of Strata Contacts