Most monitoring systems use a time series database to store historical data. RRD and traditional relational databases such as MySQL are among the most common storage backends used in popular monitoring systems such as MRTG, Cacti, Ganglia, Munin, Nagios, and Opsview. With the advent of the “NoSQL” movement, scalable and distributed data stores have become readily available in large clusters of commodity machines. This presentation introduces OpenTSDB, an open-source, horizontally scalable, general purpose time series database built on top of HBase. We show how its design can be used to monitor large clusters at an unprecedented level of granularity. With such a system, it becomes possible to track orders of magnitude more time series from thousands of hosts and applications, with a resolution of a few seconds to provide accurate real-time monitoring as well as long term trending.
When dealing with increasingly complex distributed systems and applications, engineers are faced with the growing challenge of understanding the complex state of the systems they run. All modern network equipment, operating systems, and applications export a wealth of metrics about their state and interactions with other services. In a large cluster, collecting, indexing and storing all the monitoring data becomes a daunting task due to the sheer volume of information and high rate of change. Metrics are typically collected by running an agent on the hosts. Data points are then persisted in a chronological fashion in a time series database. Being able to plot the data is of utmost importance, and staying on top of the trends is critical for capacity planning and performance monitoring. Being able to correlate different time series is tremendously helpful when trying to understand the behavior of a service or conduct postmortem analyses.
OpenTSDB is a master-less, horizontally scalable system that uses HBase to store time series data. HBase is an open-source, distributed, non-relational database modeled after Google’s Bigtable. It features low-latency, high throughput, consistent operations that are atomic at the row level, fault tolerance, and load balancing. Thanks to those key features, it becomes possible to easily store significant amounts of time series data. By choosing an appropriate schema and using efficient algorithms, millions of data points from arbitrary time series can be retrieved and graphed quickly. OpenTSDB offers a simple yet powerful query interface that allows custom graphs to be generated over arbitrary time periods and with an unprecedented granularity.
OpenTSDB has been in use at StumbleUpon for almost a year and has played a key role in helping operation and engineering teams to understand the behavior and performance of our systems, troubleshoot production issues, provide significant supporting material for postmortems, do capacity planning and trend analysis. We constantly collect many hundred metrics and hundred to thousands of data points per second.
Prior to managing StumbleUpon’s infrastructure, Benoit was part of the site reliability team running Google’s planetary-scale ad serving systems (for both AdWords and AdSense).
Comments on this page are now closed.
For information on exhibition and sponsorship opportunities at the conference, contact Susan Young at firstname.lastname@example.org
Download the Strata Sponsor/Exhibitor Prospectus
View a complete list of Strata Contacts