Corpus Bootstrapping with NLTK

Jacob Perkins (Weotta)
Deep Data
Location: A-B

When it comes to natural language processing, general APIs and generic models are often far less accurate than you want. Or maybe the APIs you need don’t even exist. Either way, you can use “corpus bootstrapping” to create custom models and APIs. Corpus bootstrapping is a method of rapidly producing a custom corpus for training highly accurate natural language processing models. For example, suppose you want to do sentiment analysis for Spanish text, but you can only find APIs and models for English. Or you want to do phrase extraction for phrases that are not exactly noun phrases. Maybe you want to classify text but there’s no corpus in existence with the categories you’re interested in. All of these problems can be solved by iterating your way to a custom corpus for training custom models.

This talk will cover:

  • creating a classified corpus from scratch
  • generating a sentiment analysis corpus in Spanish by starting with an English corpus
  • using simplified part-of-speech tags to quickly produce a custom corpus for phrase extraction
  • training custom models with NLTK-Trainer

Code examples will be in Python using NLTK.

Jacob Perkins

Weotta

Jacob is the cofounder & CTO of Weotta and the author of Python Text Processing with NLTK 2.0 Cookbook. He blogs at Streamhacker and has created both the NLTK Demos & APIs and NLTK-Trainer.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Sponsors

  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com.

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

View a complete list of Strata contacts