Apache Pig makes Apache Hadoop easier to use thanks to its high-level data flow language, Pig Latin. While writing a Pig Latin query, there are certain choices that one makes in a each statement. It starts with very the first statement in your query, the load statement, which reflects the choice you made in the storage format. These decisions can have significant impact on performance of your query. For example, choice of the join algorithm for your query could result in orders of magnitude of difference in performance. In this talk, we will discuss common data analysis tasks, the choices that one makes when writing a query and the impact of each on query run time. The core principles behind the optimization recommendations shared during this presentation are applicable to all MapReduce applications.
Knowledge of the following will be useful:
Thejas Nair is a software engineer working on Apache pig, hcatalog and hive projects at Hortoworks. He is a committer and PMC member of Apache Pig project. Previously, he worked at Yahoo for 9 years, developing solutions for large scale distributed data processing.
Jianyong Dai is a Apache Pig PMC member/committer and worked on Pig for almost 3 years at Yahoo and later at Hortonworks. I received my PhD in computer science specialize in computer security, data mining and distributed computing from University of Central Florida. I am interested in data science, large scale processing, Hadoop, Pig, HCatalog, Hive, and more.
For information on exhibition and sponsorship opportunities, contact Susan Stewart at firstname.lastname@example.org.
For information on trade opportunities contact Kathy Yu at mediapartners
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata contacts.