This tutorial offers a basic introduction to practicing data science. We’ll walk through several typical projects that range from conceptualization to acquiring data, to analyzing and visualizing it, to drawing conclusions.
We assume familiarity with the command line and the ability to use libraries and code.
Topics covered include:
Tutorial outline
For those running a UNIX distribution or Mac OS X all of the base tools (bash, Python, and R) are already installed, so you will only need to make sure that you have the supporting packages listed below. For Windows users you will need to install the tools separately from binaries, which you can download at the following sites:
- To use these command-line tools we recommend you install Cygwin to emulate a UNIX-like environment.
- Python: http://www.python.org/download/windows/
- R: http://cran.r-project.org/bin/windows/base/
A large part of analyzing data is dealing with structured and unstructured text. As such, there are several command-line tools that allow for “quick and dirty” handling of this data. For this tutorial we will rely on the following set, which come with any UNIX-like distribution:
- sed
- awk
- grep
Python is a powerful high-level scripting language that is well suited for manipulating and analyzing data of all kinds. There are a number of Python libraries for analyzing data, but for this tutorial we will focus on the following:
- email: For parsing email data
- Natural Language Toolkit (NLTK): Powerful set of tools for performing natural language processing on text
- NumPy, SciPy, matplotlib: A trio of scientific computing libraries in Python that provide data types and functions for numeric and statistical analysis, as well as visualization
- Python Image Library (PIL): For the statistical analysis of image data
- NetworkX: For the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
There are a few ways to install Python packages, but we recommend either of the following. In you Python setuptools installed you can download and install all of the above libraries with the following command:
$ easy_instal {package_name}
For example, to install NetworkX simply type:
$ easy_install networkx
You can also install packages from source by downloading the source files at the sites referenced above. Simply unarchive the source code, navigate to the folder where the source code is located, and use the following command:
$ python setup.py install
The R statistical programming language has become the de facto lingua franca for statistical analysis. There are thousands of R packages available on CRAN to perform any number of analyses. For the purposes of this tutorial we will use the extremely powerful ggplot2 package by Hadley Wickham for data visualization.
To install packages in R we use the ``install.packages`` command:
install.packages(“ggplot2”, dependencies=TRUE)
Note, ggplot2 requires several other packages, so if you are running a new R installation this may take a few minutes.
During the tutorial there will be opportunity to visualize network relationships. A very useful tool for visualizing networks in Gephi, which is a standalone application. If you wish to follow along with this portion of the tutorial please download and install Gephi.
Joseph Adler has many years of experience in data mining and data analysis at companies including DoubleClick, American Express, and VeriSign. He graduated from MIT with an B.Sc. and M.Eng in Computer Science and Electrical Engineering. He is the inventor of several patents for computer security and cryptography, and the author of “Baseball Hacks” and “R in a Nutshell”. Currently, he is a senior data scientist at LinkedIn.
Hilary Mason is the Chief Scientist at bit.ly, where she finds sense in vast data sets. Her work involves both pure research and development of product-focused features.
She’s also a co-founder of HackNY, a non-profit organization that connects talented student hackers from around the world with startups in NYC.
Hilary recently started the data science blog Dataists and is a member of hacker collective NYC Resistor.
She has discovered two new species, loves to bake cookies, and asks way too many questions.
Drew Conway is a PhD student in political science at New York University. Drew studies terrorism and armed conflict; using tools from mathematics and computer science to gain a deeper understanding of these phenomena.
Jake Hofman is a member of the Human Social Dynamics group at Yahoo! Research. His work involves data-driven modeling of social data, focusing on applications of machine learning and statistical inference to large-scale data. He holds a B.S. in Electrical Engineering from Boston University and a Ph.D. in Physics from Columbia University.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here
(requires login)
For information on exhibition and sponsorship opportunities at the conference, contact Susan Young at syoung@oreilly.com
Download the Strata Sponsor/Exhibitor Prospectus
View a complete list of Strata Contacts
Comments
I really wanted to like this session, but the bootcamp was more of an overview. It tried to cover half-a-semester’s worth of machine learning topics, python, R, and a specific graphics library for R in a day. Way too much material – drinking from a firehose. The presenters clearly did a ton of work getting this all together, but it’s way too much unless you already know the material being covered.
Not previewing the formatting and effectiveness / visibility of even “open source tool” produced presentations is inexcusable when that comes out as bad as this [dirty and worn] “boot camp” did.
I found the boot camp both fascinating and frustrating. fascinating because it manages to give a glimpse of what’s within reach once one has a grip on fairly basic skills. frustrating as it wasn’t hands-on and therefore I didn’t instantly turn into a data scientist as I had hoped. I would have thought we would have used our computers a bit more or at least that there would be more interaction. I am giving a very positive note nonetheless because the session succeeded in motivating me to learn what I don’t know in my ample free time.
Not really a boot camp, because I don’t see how you could work at that level of detail for such a huge group. I found it to be a good survey.
I wasn’t very pleased with the first half of this session (after the first half, I bailed for another session). I wish all the install prerequisites had been made more clear ahead of time. Other session presenters brought USB keys with all the materials to avoid this problem. There were no power strips so people’s computers were giving out, and the wireless was so bad I couldn’t download even the slides and code .zip bundle within a couple hours. And the LaTeX slides were ofte illegible.
Besides the technical snafus, I’m not sure who the talks were targeted to. If you weren’t already skilled at statistics and using R and SciPy, the super fast and cursory overview (liberally spiced with “you’ve all done this before right?) wouldn’t bring anyone up to speed – and if anyone had done it before, then a “hello world” kind of simple example in them wouldn’t give them anything extra either. I would give the feedback that either the talks should be retooled (both in content and logistics) to either be clearly onboarding for beginners or pro tips for experts. The current content falls into the middle field of “neither.”
I appreciate all the hard work the presenters did to put together the session and am sorry to be critical, but I think there needs to be some meta-thought put into this session before re-delivering it in whole or in part.
@jerome see github.com/drewconway/strat...—also you can ssh to demo@bootcamp.infochimps.com to use a machine with most of what you need already installed
@Jerome, you can get all the materials here: github.com/drewconway/strat... or if you just want the slides and code go to bit.ly/campyslides
Hi I arrived a bit late, where could I grab the code examples and slides
For mac os/x folk, look at stronginference.com/scipy-s...
For Windows users, easy_install reports errors installing matplotlib and warnings installing scipy. I went to their websites instead and used the binary installers for Windows, which seems to handle everything okay. (For Python 2.7, you need to go to the beta version of SciPy to download a compatible binary.)
Thanks JB. I’ve installed (or attempted to install) all the libs, about half seem fine; half had errors. Hopefully we can sort out at the boot camp! Might be good to have a little pre-help session for those who might not have the libraries installed correctly.
Elisabeth, we’ve added Software Requirements to the session description above. Let us know if you have any questions!
Should we have anything in particular installed on our machines?
@Rob
Each of the sections involves the use of the command-line and writing and running code, but is not the entirety of what will be covered. Active participation is completely up to you, and you are welcome to be involved in as much or as little as you like.
Also, all code examples will be made available on the day of the tutorial for you to take back with you and practice.
Can you provide a little more guidance as to how much “familiarity with the command line and the ability to use libraries and code” will be needed for the day to be beneficial? I have experience in both and can easily follow examples, but I have been out of active coding for several years. How much of the day will be active participation vs. examples with reference material for later?