Skip to main content

Multivariate

Description

Recent years' informatics advances have increased availability of various sources of health-monitoring information to agencies responsible for disease surveillance. These sources differ in clinical relevance and reliability, and range from streaming statistical indicator evidence to outbreak reports. Information-gathering advances have outpaced the capability to combine the disparate evidence for routine decision support. In view of the need for analytical tools to manage an increasingly complex data environment, a fusion module based on Bayesian networks (BN) was developed in 2011 for the Dept. of Defense (DoD) Electronic Surveillance System for the Early Notification of Community-Based Epidemics (ESSENCE). In 2012 this module was expanded with syndromic queries, data-sensitive algorithm selection, and hierarchical fusion network training [1]. Subsequent efforts have produced a full fusion-enabled version of ESSENCE for beta testing, further upgrades, and a software specification for live DoD integration. Beta test reviewers cited the reduced alert burden and the detailed evidence underlying each alert. However, only 39 reported historical events were available for training and calibration of 3 networks designed for fusion of influenza-like-illness, gastrointestinal, and fever syndrome categories. The current presentation describes advances to formalize the network training, calibrate the component alerting algorithms and decision nodes together for each BN, and implement a validation strategy aimed at both the ESSENCE public health user and machine learning communities.

Objective

This presentation aims to reduce the gap between multivariate analytic surveillance tools and public health acceptance and utility. We developed procedures to verify, calibrate, and validate an evidence fusion capability based on a combination of clinical and syndromic indicators and limited knowledge of historical outbreak events.

Submitted by elamb on
Description

Parallel surveillance, separate monitoring of each continuous series, has been widely used for multivariate surveillance, however, it has severe limitations. Firstly, it faces the problem of multiplicity from multiple testing. Also, the ignorance of CBS reduces the performance of outbreak detection if data are truly correlated. Finally, since health data are normally dependent over time, CWS is another issue which should be taken into account. Sufficient reduction methods are used to reduce the dimensionality of a simple multivariate series to a univariate series which has been proved to be sufficient for monitoring a mean shift in multivariate surveillance (1 and 2). Having considered the sufficiency property and the nature of health data, we propose a sufficient reduction method for detecting a mean shift in multivariate series where CWS and CBS are taken into account.

Objective

To reduce the dimensionality of p-dimensional multivariate series to a univariate series by deriving sufficient statistics which take into account all the information in the original data, correlation within series (CWS) and correlation between series (CBS).

Submitted by elamb on
Description

Noroviruses are the single most common cause of epidemic, non-bacterial gastroenteritis worldwide. NoVs cause an estimated 68-80% of gastroenteritis outbreaks in industrialized countries and possibly more in developing countries.

Objective

The purpose of this study was to identify global epidemiologic trends in human norovirus (NoV) outbreaks by transmission route and setting, and describe relationships between these characteristics, attack rates and the occurrence of genogroup I (GI) or genogroup II (GII) strains in outbreaks.

Submitted by elamb on
Description

Much progress has been made on the development of novel systems for influenza surveillance, or explored the choices of algorithms for detecting the start of a peak season. The use of multiple streams of surveillance data has been shown to improve performance but few studies have explored its use in situational awareness to quantify level or trend of disease activity. In this study we propose a multivariate statistical approach which describes overall influenza activity and handles interrupted or drop-in surveillance systems.

 

Objective

This paper describes the use of multiple influenza surveillance data for situational awareness of influenza activity.

Submitted by elamb on
Description

INDICATOR is a multi-stream open source platform for biosurveillance and outbreak detection, currently focused on Champaign County in Illinois. It has been in production since 2008 and is currently receiving data from emergency department, patient advisory nurse, outpatient convenient care clinic, school absenteeism, animal control, and weather sources. Historical data from some of these sources goes back to 2006.

 

Objective

To examine the correlation between different types of surveillance signals and climate information obtained from a well-defined geographic area.

Submitted by elamb on
Description

Time series analysis is very popular in syndromic surveillance. Mostly, public health officials track in the order of hundreds of disease models or univariate time series daily looking for signals of disease outbreaks. These time series can be aggregated counts of various syndromes, possibly different genders and age-groups. Recently, spatial scan algorithms find anomalous regions by aggregating zipcode level counts [1]. Usually, public health officials have a set of disease models (for e.g. fever or headache symptom in male adults is indicative of a particular disease). Based on the past experience public health officials track these disease models daily to find anomalies that might be indicative of disease outbreaks. A typical syndromic surveillance system these days will track in the order of 100-200 time series on daily basis using different univariate algorithms like CUSUM, moving average, EWMA, etc.

Let us consider a representative dataset of a state which has 100 zipcodes that monitors 10 syndromes among 3 age groups and 2 genders in emergency rooms. There are a total of 6,000 (100 x 10 x 3 x 2) distinct time series for a particular zipcode, syndrome, age-group and gender. This number already seems too high to monitor daily. Hence most syndromic systems only monitor state level aggregates for all syndromes or a few combinations of syndromes, gender and age-groups.

But most real world disease models are more complex and affect multiple syndromes, or multiple agegroups. We need to analyze more complex streams that aggregate multiple values in the attributes to mine more interesting patterns not seen otherwise. As an example, a massive search could reveal that recently senior female patients having fever and nausea have increased in the north eastern part of the state.

Objective

This paper shows how T-Cubes, a data structure that makes tracking millions of disease models simultaneously feasible, can be used to perform multivariate time series analysis using primitive univariate algorithms. Hence, the use of T-Cube in brute-force search helps identify stronger disease outbreak signals currently missed by the surveillance systems.

Submitted by elamb on
Description

We propose a novel technique for building generative models of real-valued multivariate time series data streams. Such models are of considerable utility as baseline simulators in anomaly detection systems. The proposed algorithm, based on Linear Dynamical Systems (LDS) [1], learns stable parameters efficiently while yielding more accurate results than previously known methods. The resulting model can be used to generate infinitely long sequences of realistic baselines using small samples of training data.

Submitted by elamb on
Description

Current state-of-the-art outbreak detection methods [1-3] combine spatial, temporal, and other covariate information from multiple data streams to detect emerging clusters of disease.  However, these approaches use fixed methods and models for analysis, and cannot improve their performance over time.   Here we consider two methods for overcoming this limitation, learning a prior over outbreak regions and learning outbreak models from user feedback, using the recently proposed multivariate Bayesian scan statistic (MBSS) framework [1]. Given a set of outbreak types {Ok}, set of space-time regions S, and the multivariate dataset D, MBSS computes the posterior probability Pr(H1(S, Ok) | D) of each outbreak type in each region, using Bayes’ Theorem to combine the prior probabilities Pr(H1(S, Ok)) and the data likelihoods Pr(D | H1(S, Ok)). Each outbreak type can have a different prior distribution over regions, as well as a different model for its effects on the multiple streams.  The set of outbreak types, as well as the region priors and outbreak models for each type, can be learned incrementally from labeled data or user feedback.

Objective

We argue that the incorporation of machine learning algorithms is a natural next step in the evolution and improvement of disease surveillance systems. We consider how learning can be incorporated into one recently proposed multivariate detection method, and demonstrate that learning can enable systems to substantially improve detection performance over time.

Submitted by elamb on
Description

Current syndromic surveillance systems run multiple simultaneous univariate procedures, each focused on detecting an outbreak in a single data stream. Multivariate procedures have the potential to better detect some types of outbreaks, but most of the existing methods are directionally invariant and are thus less relevant to the problem of syndromic surveillance. This article develops two directionally sensitive multivariate procedures and compares the performance of these procedures both with the original directionally invariant procedures and with the application of multiple univariate procedures using both simulated and real syndromic surveillance data. The performance comparison is conducted using metrics and terminology from the statistical process control (SPC) literature with the intention of helping to bridge the SPC and syndromic surveillance literatures. This article also introduces a new metric, the average overlapping run length, developed to compare the performance of various procedures on limited actual syndromic surveillance data. Among the procedures compared, in the simulations the directionally sensitive multivariate cumulative sum (MCUSUM) procedure was preferred, whereas in the real data the multiple univariate CUSUMs and the MCUSUM performed similarly. This article concludes with a brief discussion of the choice of performance metrics used herein versus the metrics more commonly used in the syndromic surveillance literature (sensitivity, specificity, and timeliness), as well as some recommendations for future research.

Submitted by elamb on