Skip to main content

Data Analytics

Description

The Veterans Health Administration (VHA) uses the Electronic Surveillance System for the Early Notification of Community-based Epidemics to detect disease outbreaks and other health-related events earlier than other forms of surveillance. Although Veterans may use any VHA facility in the world, the strongest predictor of which health care facility is accessed is geographic proximity to the patient's residence. A number of outbreaks have occurred in the Veteran population when geographically separate groups convened in a single location for professional or social events. One classic example was the initial Legionnaire's disease outbreak, identified among participants at the Legionnaire's convention in Philadelphia in the late 1970s. Numerous events involving travel by large Veteran (and employee) populations are scheduled each year.

 

Objective

To develop an algorithm to identify disease outbreaks by detecting aberrantly large proportions of patient residential ZIP codes outside a health care facility catchment area.

Submitted by elamb on
Description

An expanded ambulatory health record, the Comprehensive Ambulatory Patient Encounter Record (CAPER) will provide multiple types of data for use in DoD ESSENCE. A new type of data not previously available is the Reason for Visit (ROV), a free-text field analogous to the Chief Complaint (CC). Intake personnel ask patients why they have come to the clinic and record their responses. Traditionally, the text should reflect the patient's actual statement. In reality the staff often "translates" the statement and adds jargon. Text parsing maps key words or phrases to specific syndromes. Challenges exist given the vagaries of the English language and local idiomatic usage. Still, CC analysis by text parsing has been successful in civilian settings [1]. However, it was necessary to modify the parsing to reflect the characteristics of CAPER data and of the covered population. For example, consider the Shock/Coma syndrome. Loss of consciousness is relatively common in military settings due to prolonged standing, exertion in hot weather with dehydration, etc., whereas the main concern is shock/coma due to infectious causes. To reduce false positive mappings the parser now excludes terms such as syncope, fainting, electric shock, road march, parade formation, immunization, blood draw, diabetes, hypoglycemic, etc.

Objective

Rather than rely on diagnostic codes as the core data source for alert detection, this project sought to develop a Chief Complaint (CC) text parser to use in the U.S. Department of Defense (DoD) version of the Electronic Surveillance System for Early Notification of Community-Based Epidemics (ESSENCE), thereby providing an alternate evidence source. A secondary objective was to compare the diagnostic and CC data sources for complementarity.

Submitted by elamb on
Description

Parallel surveillance, separate monitoring of each continuous series, has been widely used for multivariate surveillance, however, it has severe limitations. Firstly, it faces the problem of multiplicity from multiple testing. Also, the ignorance of CBS reduces the performance of outbreak detection if data are truly correlated. Finally, since health data are normally dependent over time, CWS is another issue which should be taken into account. Sufficient reduction methods are used to reduce the dimensionality of a simple multivariate series to a univariate series which has been proved to be sufficient for monitoring a mean shift in multivariate surveillance (1 and 2). Having considered the sufficiency property and the nature of health data, we propose a sufficient reduction method for detecting a mean shift in multivariate series where CWS and CBS are taken into account.

Objective

To reduce the dimensionality of p-dimensional multivariate series to a univariate series by deriving sufficient statistics which take into account all the information in the original data, correlation within series (CWS) and correlation between series (CBS).

Submitted by elamb on
Description

Spatial cluster analysis is considered an important technique for the elucidation of disease causes and epidemiological surveillance. Kulldorff's spatial scan statistic, defined as a likelihood ratio, is the usual measure of the strength of geographic clusters. The circular scan, a particular case of the spatial scan statistic, is currently the most used tool for the detection and inference of spatial clusters of disease.

Kulldorff's spatial scan statistic for aggregated area maps searches for clusters of cases without specifying their size (number of areas) or geographic location in advance. Their statistical significance is tested while adjusting for the multiple testing inherent in such a procedure. However, as is shown in this work, this adjustment is not done in an even manner for all possible cluster sizes. We propose a modification to the usual inference test of the spatial scan statistic, incorporating additional information about the size of the most likely cluster found.

 

Objective

We propose a modification to the usual inference test of the spatial scan statistic, incorporating additional information about the size of the most likely cluster found.

Submitted by elamb on
Description

The spatial scan statistic proposed by Kulldorff has been widely used in spatial disease surveillance and other spatial cluster detection applications. In one of its versions, such scan statistic was developed for inhomogeneous Poisson process. However, the underlying Poisson process may not be suitable to properly model the data. Particularly, for diseases with very low prevalence, the number of cases may be very low and zero excess may cause bias in the inferences.

Lambert introduced the zero-inflated Poisson (ZIP) regression model to account for excess zeros in counts of manufacturing defects. The use of such model has been applied to innumerous situations. Count data, like contingency tables, often contain cells having zero counts. If a given cell has a positive probability associated to it, a zero count is called a sampling zero. However, a zero for a cell in which it is theoretically impossible to have observations is called structural zero.

 

Objective

The scan statistic is widely used in spatial cluster detection applications of inhomogeneous Poisson processes. However, real data may present substantial departure from the underlying Poisson process. One of the possible departures has to do with zero excess. Some studies point out that when applied to zero-inflated data the spatial scan statistic may produce biased inferences. Particularly, Gomez-Rubio and Lopez-Quılez argue that Kulldorff’s scan statistic may not be suitable for very rare diseases problems. In this work we develop a closed-form scan statistic for cluster detection of spatial count data with zero excess.

Submitted by elamb on
Description

Ordering-based approaches [1,2] and quadtrees [3] have been introduced recently to detect multiple spatial clusters in point event datasets. The Autonomous Leaves Graph (ALG) [4] is an efficient graph-based data structure to handle the communication of cells in discrete domains. This adaptive data structure was favorably compared to common tree-based data structures (quad-trees). An additional feature of the ALG data structure is the total ordering of the component cells through a modified adaptive Hilbert curve, which links sequentially the cells (the orange curve in the example of Figure 1).

Objective

To detect multiple significant spatial clusters of disease in case-control point event data using the Autonomous Leaves Graph and the spatial scan statistic.

Submitted by elamb on
Description

Data obtained through public health surveillance systems are used to detect and locate clusters of cases of diseases in space-time, which may indicate the occurrence of an outbreak or an epidemic. We present a methodology based on adaptive likelihood ratios to compare the null hypothesis (no outbreaks) against the alternative hypothesis (presence of an emerging disease cluster).

 

Objective

Disease surveillance is based on methodologies to detect outbreaks as soon as possible, given an acceptable false alarm rate. We present an adaptive likelihood ratio method based on the properties of the martingale structure which allows the determination of an upper limit for the false alarm rate.

Submitted by elamb on
Description

The ability to rapidly detect any substantial change in disease incidence is of critical importance to facilitate timely public health response and, consequently, to reduce undue morbidity and mortality. Unlike testing methods (1, 2), modeling for spatio-temporal disease surveillance is relatively recent, and this is a very active area of statistical research (3). Models describing the behavior of diseases in space and time allow covariate effects to be estimated and provide better insight into etiology, spread, prediction and control. Most spatio-temporal models have been developed for retrospective analyses of complete data sets (4). However, data in public health registries accumulate over time and sequential analyses of all the data collected so far is a key concept to early detection of disease outbreaks. When the analysis of spatially aggregated data on multiple diseases is of interest, the use of multivariate models accounting for correlations across both diseases and locations may provide a better description of the data and enhance the comprehension of disease dynamics.

Objective

This study deals with the development of statistical methodology for on-line surveillance of small area disease data in the form of counts. As surveillance systems are often focused on more than one disease within a predefined area, we extend the surveillance procedure to the analysis of multiple diseases. The multivariate approach allows for inclusion of correlation across diseases and, consequently, increases the outbreak detection capability of the methodology

Submitted by elamb on
Description

The spatial scan statistic [1] detects significant spatial clusters of disease by maximizing a likelihood ratio statistic F(S) over a large set of spatial regions, typically constrained by shape. The fast localized scan [2] enables scalable detection of irregular clusters by searching over proximity-constrained subsets of locations, using the linear-time subset scanning (LTSS) property to efficiently search over all subsets of each location and its k - 1 nearest neighbors. However, for a fixed neighborhood size k, each of the 2[k] subsets are considered equally likely, and thus the fast localized scan does not take into account the spatial attributes of a subset. Hence, we wish to extend the fast localized scan by incorporating soft constraints which give preference to spatially compact clusters while still considering all subsets within a given neighborhood.

Objective

We present a new method for efficiently and accurately detecting irregularly-shaped outbreaks by incorporating "soft" constraints, rewarding spatial compactness and penalizing sparse regions.

Submitted by elamb on
Description

PyConTextKit is a web-based platform that extracts entities from clinical text and provides relevant metadata - for example, whether the entity is negated or hypothetical - using simple lexical clues occurring in the window of text surrounding the entity. The system provides a flexible framework for clinical text mining, which in turn expedites the development of new resources and simplifies the resulting analysis process. PyConTextKit is an extension of an existing Python implementation of the ConText algorithm, which has been used successfully to identify patients with an acute pulmonary embolism and to identify patients with findings consistent with seven syndromes. Public health practitioners are beginning to have access to clinical symptoms, findings, and diagnoses from the EMR. Making use of this data is difficult, because much of it is in the form of free text. Natural language processing techniques can be leveraged to make sense of this text, but such techniques often require technical expertise. PyConTextKit provides a web-based interface that makes it easier for the user to perform concept identification for surveillance. We describe the development of a web based application - PyConTextKit - to support text mining of clinical reports for public health surveillance.

 

Objective

We describe the development of a web based application - PyConTextKit - to support text mining of clinical reports for public health surveillance.

Submitted by elamb on