Skip to main content

Rapid Processing of Ad-Hoc Queries against Large Sets of Time Series

Description

Time series analysis is very common in syndromic surveillance. Large scale biosurveillance systems typically perform thousands of time series queries per day: for example, monitoring of nationwide over-thecounter (OTC) sales data may require separate time series analyses on tens of thousands of zip codes. More complex query types (e.g. queries over various combinations of patient age, gender, and other characteristics, or spatial scans performed over all potential disease clusters) may require millions of distinct queries. Commercial OLAP databases provide data cubes to handle such ad hoc queries, but these methods typically suffer from long build times (typically hours), huge memory requirements (requiring the purchase of high-end database servers), and high maintenance costs. Additionally, data cubes typically require 1 second or more to respond to each complex query. This delay is an inconvenience to users who want to perform multiple queries in an online fashion; additionally, data cubes are far too slow for statistical analyses requiring millions of complex queries, which would require days of processing time.

Objective

We present T-Cube, a new tool for very fast retrieval and analysis of time series data. Using a novel method of data caching, T-Cube performs time series queries approximately 1,000 times faster than standard state-of-the-art data cube technologies. This speedup has two main benefits: it enables fast anomaly detection by simultaneous statistical analysis of many thousands of time series, and it allows public health users to perform many complex, ad hoc time series queries on the fly without inconvenient delays.

Submitted by elamb on