Survival Analysis for Cache Time-to-Live Optimization

Robert Lancaster (Orbitz Worldwide)
Deep Data, A-B
Average rating: ****.
(4.00, 1 rating)

The Orbitz family of travel sites receives hundreds of thousands of searches each day from consumers looking for hotels. A single consumer initiated request may spawn dozens of individual hotel rate look-ups resulting in many millions of such requests per day. Most hotel inventory is managed by suppliers, often in more antiquated systems not capable of handling a large number of requests. In order to minimize the impact of high consumer traffic volume on these suppliers, Orbitz caches rate information locally. Such caching also helps Orbitz maintain look-to-book ratios with their suppliers and reduces latency experienced by the consumer.

Hotel rate and availability information can change over time causing cached information to go stale. This is not desired since it causes consumers to experience discrepancies between cached and real-time rates. Therefore, each piece of information stored in the cache is given a time-to-live (TTL). Historically, the TTL values have been determined by business intuition and have not been generated via a data-driven approach.

Here we investigate the applicability of predictive modelling to optimize TTL values for rates in our hotel rate cache. Specifically, we examine survival analysis as a means of modelling hotel rate volatility. Survival analysis is a statistical technique which models the time until the occurrence of a particular event. In the context of biological organisms, the event of interest is often the death of the organism (hence the name). In our context, the event of interest is a change in the rate offered by a hotel or in its availability status.

We highlight some of the technical challenges in collecting nearly a billion records each day. This includes how we use MongoDB as a collection mechanism for real-time events emitted by our hotel applications before being transferred to Hadoop for long term storage and processing. We’ll also cover how the data is prepared in Hadoop prior to being made available for use in building and evaluating our predictive models.

Finally, we show how our results have both challenged some and confirmed other of our long-held assumptions on rate volatility and how we’re using these results to improve our look-to-book and cache hit ratios while reducing rate discrepancies experienced by consumers.

Photo of Robert Lancaster

Robert Lancaster

Orbitz Worldwide

Rob Lancaster has been in software development for the last 13 years, developing solutions for the travel industry. He is currently a Solutions Architect for Orbitz with a focus on applying predictive analysis to improve the performance of Orbitz hotel systems. He is the organizer of Chicago’s Machine Learning meetup group and an organizer for Chicago’s Big Data user group.

Sponsors

  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com.

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

View a complete list of Strata contacts