When it comes to natural language processing, general APIs and generic models are often far less accurate than you want. Or maybe the APIs you need don’t even exist. Either way, you can use “corpus bootstrapping” to create custom models and APIs. Corpus bootstrapping is a method of rapidly producing a custom corpus for training highly accurate natural language processing models. For example, suppose you want to do sentiment analysis for Spanish text, but you can only find APIs and models for English. Or you want to do phrase extraction for phrases that are not exactly noun phrases. Maybe you want to classify text but there’s no corpus in existence with the categories you’re interested in. All of these problems can be solved by iterating your way to a custom corpus for training custom models.
This talk will cover:
Code examples will be in Python using NLTK.
Jacob is the cofounder & CTO of Weotta and the author of Python Text Processing with NLTK 2.0 Cookbook. He blogs at Streamhacker and has created both the NLTK Demos & APIs and NLTK-Trainer.
For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com.
For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com
For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com
View a complete list of Strata contacts