Skip to main content

Data Analytics

Presented December 13, 2018.

For public health surveillance, is machine learning worth the effort? What methods are relevant? Do you need special hardware? This talk was motivated by these and other questions asked by ISDS members. It will focus on providing practical—and slightly opinionated—advice about how to determine whether machine learning could be a useful tool for your problem.

Presenter

Description

The space-time scan statistic is a powerful statistical tool for prospective disease surveillance. It searches over a set of spatio-temporal regions (each representing some spatial area S for the last k days), finding the most significant regions (S, k) by maximizing a likelihood ratio statistic, and computing p-values of these potential clusters by randomization.

The standard, "population-based" method assumes that, for each spatial location si on each day t, we have a population pti and a count (observed number of cases) cti. Then, under the null hypothesis of no clusters, we expect each count cti to be proportional to its population pti. We then search for regions (S, k) with disease rate (cases per unit population) significantly higher inside the region than outside. In the original space-time scan statistic, the populations are assumed to be given, and in, populations are estimated assuming independence of space and time.

Here we propose an alternative, "expectation-based" method, in which we infer the expected number of cases bti in each spatial location, based on the time series of previous counts. In this case, under the null hypothesis of no clusters, we expect each count cti to be equal to bti, rather than proportional to population. We then search for regions (S, k) with counts that are significantly higher than expected.

 

Objective

This paper describes a new class of space-time scan statistics designed for rapid detection of emerging disease clusters. We evaluate these methods on the task of prospective disease surveillance, and show that our methods consistently outperform the standard space-time scan statistic approach.

Submitted by Sandra.Gonzale… on
Description

There is limited closed-form statistical theory to indicate how well the prospective space-time permutation scan statistic will perform in the detection of localized excess illness activity. Instead, detection methods can be applied to simulated data to gain insight about detection performance. Such results are dependent on the way outbreaks are simulated and the nature of the background data. As an alternative, we explore an empirical approach in which the membership of a large health plan is used to represent a community and detection performance is assessed in samples from the larger group.

 

Objective

Our goal was to assess the impact of sentinel sample size and criteria for a signal on performance of daily prospective space-time permutation detection by comparing results in varying size random samples from a large health plan to results found in the full membership.

Submitted by Sandra.Gonzale… on
Description

Expectation-based scan statistics extend the traditional spatial scan statistic approach by using historical data to infer the expected counts for each spatial location, then detecting regions with higher than expected counts. Here we consider five recently proposed expectation-based statistics: the expectation-based Poisson (EBP), expectation-based Gaussian (EBG), population-based Poisson (PBP), populationbased Gaussian (PBG), and robust Bernoulli-Poisson (RBP) methods. We also consider five different time series analysis methods used to predict the expected counts (including the Holt-Winters method and moving averages optionally adjusted for day of week and seasonality), giving a total of 25 methods to compare. All of these methods are detailed in the full paper.

 

Objective

We present a systematic empirical comparison of five recently proposed expectation-based scan statistics, in order to determine which methods are most successful for which spatial disease surveillance tasks.

Submitted by Sandra.Gonzale… on
Description

ARIMA models use past values (autoregressive terms) and past forecasting errors (moving average terms) to generate future forecasts, making it a potential candidate method for modeling citywide time series of syndromic data [1]. While past research supports the use of ARIMA modeling as a detection algorithm in syndromic surveillance [2], there has been little evaluation of an ARIMA model's prospective outbreak detection capabilities. We built an ARIMA model to prospectively detect simulated outbreaks in ED syndromic data. This method is one of eight being formally evaluated as part of a grant from the Alfred P. Sloan Foundation.

Objective

To evaluate seasonal autoregressive integrated moving average (ARIMA) models for prospective analysis of New York City (NYC) emergency department (ED) syndromic data.

Submitted by knowledge_repo… on
Description

We previously experimented with tracking influenza in ER chief complaint data using existing syndromic surveillance tools. We identified several deficiencies in these tools: poor natural language processing, inefficient user interfaces, frequent (thus costly) false alarms, and one-size-fits-all approaches to syndromes. Furthermore, we were surprised that some epidemiologists we spoke with had relatively little faith in existing surveillance tools, and so we set out to build one that would address their concerns: DADAR (Data Analysis, Detection, And Response).

Objective

To develop an adaptable platform for periodically loading semi-structured medical text, extracting syndromic information using advanced natural language processing, detecting outbreaks in the data (including the ability to tune sensitivity vs. specificity on a syndrome-by-syndrome basis so as to reduce the rate of false alarms), generating timely cartographic surveillance reports, and providing tools to quickly validate or rule out syndromic alerts.

Submitted by knowledge_repo… on
Description

As technology advances, the implementation of statistically and computationally intensive methods to detect unusual clusters of illness becomes increasingly feasible at the state and local level [2]. Bayesian methods allow for the incorporation of prior knowledge directly into the model, which could potentially improve estimation of expected counts and enhance outbreak detection. This method is one of eight being formally evaluated as part of a grant from the Alfred P. Sloan Foundation.

Objective

To adapt a previously described Bayesian model-based surveillance technique for cluster detection [1] to NYC Emergency Department (ED) visits.

Submitted by knowledge_repo… on
Description

The success of syndromic surveillance depends on the ability of the surveillance community to quickly and accurately recognize anomalous data. Current methods of anomaly detection focus on sets of syndromic categories and rely on a priori knowledge to map chief complaints to these general syndromic categories. As a result, the mapping scheme may miss key terms and phrases that have not previously been used. Furthermore, analysts do not have a good way of being alerted to these new terms in order to determine if they should be added to the syndromic mapping schema. We use a dynamic dictionary of terms to side-step the downfalls of a priori knowledge in this rapidly evolving field by alerting the analyst to rare and brand new words used in the chief complaint field.

Objective

To automate the detection of very unusual emergency department chief complaints based on a comparison between a trained dictionary of terms and the unstructured chief complaint field.

Submitted by knowledge_repo… on
Description

Epidemic dynamics of dengue fever are driven by complex interactions between hosts, vectors and viruses that are influenced by environmental and climatic factors [1]. The development of new methods to identify such specific characteristics becomes crucial to better understand and control spatiotemporal transmission. We concentrated our efforts on applying sequential pattern mining [2] to an epidemiological and meteorological dataset to identify potential drivers of dengue fever outbreaks.

Objective

We used a data mining method based on sequential patterns extraction to identify local meteorological drivers of dengue fever epidemics in French Guiana.

Submitted by knowledge_repo… on