Skip to main content

Data Analytics

Description

A major goal of biosurveillance is the timely detection of an infectious disease outbreak. Once a disease has been identified, another very important goal is to find all known cases of the disease to assist public health investigators. Natural language processing (NLP) systems may be able to assist in identifying epidemiological variables and decrease time-consuming manual review of records.

 

Objective

To identify epidemiologically important factors such as infectious disease exposure history, travel or specific variables from unstructured data using NLP methods.

Submitted by elamb on
Description

The North Carolina Disease Event Tracking and Epidemiologic Collection Tool (NC DETECT) serves public health users across NC at the local, regional and state levels, providing early event detection and situational awareness capabilities. At the state level, our primary users are in the General Communicable Disease Control Branch of the NC Division of Public Health. NC DETECT receives 10 different data feeds daily including emergency department visits, emergency medical service runs, poison center calls, veterinary laboratory test results, and wildlife treatment.

In order to fulfill our users’ needs with NC DETECT’s limited staff, business intelligence tools are utilized for the acquisition and processing of our multiple, disparate data sources as well as reporting our findings to our numerous end users. Business intelligence can be described as a broad category of application programs and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions.

 

Objective

We report here on how NC DETECT uses business intelligence tools to automate both data capture and reporting in order to run a comprehensive surveillance system with limited resources.

Submitted by elamb on
Description

There has been much recent interest in using disease signatures to better recognize disease outbreaks. Conversely, the metrics used to describe these signatures can also be used to better characterize the outbreaks. Recent work at the New York City Department of Health has shown the ability to identify characteristic age-specific patterns during influenza outbreaks. One issue that remains is how to implement a search for such patterns using prospective outbreak detection tools such as SatScan.

A potential approach to this problem arises from another currently active research area: the simultaneous use of multiple datastreams. One form of this is to disaggregate a data stream with respect to a third variable such as age. Two drawbacks to this approach are that the categories used to make the streams have to be defined a priori and that relationships between the streams cannot be exploited. Furthermore, the resulting description is less rich as it describes outbreaks in a few non-overlapping age-specific streams. It would be desirable to look for age specific patterns with the age groupings implicitly defined.

 

Objective

This paper presents an implementation of a citywide SatScan analysis that uses age as a one-dimensional spatial variable. The resulting clusters identify age-specific clusters of respiratory and fever/flu syndromes in the New York City Emergency Department Data.

Submitted by elamb on
Description

One of the common tasks faced by the U.S. Department of Agriculture (USDA) food safety analysts is to estimate the risk of observing positive outcomes of microbial tests of food samples collected at the slaughter and food processing establishments. Resulting risk estimates can be used, among other criteria, to drive allocation of FSIS investigative resources. The Activity From Demographics and Links (AFDL) algorithm is a computationally efficient method for estimating activity of unlabeled entities in a graph from patterns of connectivity of known active entities, and from their demographic profiles. It has been successfully used in social network analysis and intelligence applications. In order to test its utility in the food safety context, we treat a co-occurrence of the same strain of bacteria (in particular a specific serotype of Salmonella) in samples taken at different establishments at roughly the same time, as a link in the graph spanning all of the USDA controlled establishments. Now, given the historical patterns of linkage and the information about the distribution of the currently observed microbial positives (which make the corresponding establishments “active” in the AFDL terminology), we aim at predicting which of the remaining establishments are likely to also report positive results of tests. Even though such definition of a link produces uncertain data given that the co-occurrences of specific test results at different establishments may be purely coincidental and our analysis does not attempt to distinguish them from truly correlated instances, we expect that using this inherently noisy data in combination with demographic features of establishments, would lead to useful predictability of microbial events.

 

Objective

The objective of the research summarized in this paper is to evaluate utility of the AFDL in predicting likelihood of positive isolates obtained from microbial testing of food samples collected at the USDA controlled establishments.

Submitted by elamb on
Description

Medical surveillance in the military can be improved through the use of clinical laboratory results collected within the Military Health System. This presentation describes an effort to establish Electronic Laboratory Reporting in the military using existing Health Level 7 (HL7) messages. HL7 data is being evaluated for data integrity, completeness, reliability and validity. In addition, initial efforts to evaluate, standardize, and use this data to support investigations of interest over the past year are presented.

 

Objective

This presentation describes the HL7 clinical lab results dataset and how it can and has been used for medical surveillance in the military.

Submitted by elamb on
Description

SaTScan is a program often used for space-time cluster detection. In order to run SaTScan, the data must be in a pre-specified text format. Once the input files are in the correct format, the typical user opens SaTScan, chooses the appropriate options, and runs SaTScan. The output from SaTScan consists of one or more text files with statistical and geographical information about the clusters. Errors in SaTScan often require re-extraction of the data into the specified text format.

When running SaTScan many times per day, as is commonly done in surveillance, it can be cumbersome to create all of the necessary data sets and run SaTScan. This is also true for any kind of evaluation of systems that rely on SaTScan for surveillance. In addition, the lack of graphical output, such as a map of the areas identified in the cluster, detracts from the utility of otherwise excellent software.

 

Objective

The purpose of this project was to create a SAS (SAS Institute, Cary, NC) interface for SaTScan which can be used to create the necessary input files, run SaTScan directly from SAS (without using SaTScan’s GUI), and to combine the output with geographic boundary files to create a single-page output containing a map and statistics describing the resulting clusters found by SaTScan.

Submitted by elamb on
Description

Analysis of time series data requires accurate calculation of a predicted value. Non-regression methods such as the Early Aberration Reporting System CuSum are computationally simple, but most do not adjust for day of week or holiday. Alternately, regression methods require larger counts, more computer resources, and possibly longer baseline periods of data. As increasing volumes of data are reported and analyzed, the predictive accuracy of simpler methods should be assessed and optimized.

 

Objective

To compare the predictive accuracy of three non-regression methods in analysis of time series count data.

Submitted by elamb on
Description

The Centers for Disease Control and Prevention BioSense has developed chief complaint (CC) and ICD9 sub syndrome classifiers for the major syndromes for early event detection and situational awareness. The prevalence of these sub-syndromes in the emergency department population and the performance of these CC classifiers have been little studied. Chart reviews have been used in the past to study this type of question but because of the large number of cases to review, the labor involved would be prohibitive. Therefore, we used an ICD9 code classifier for a syndrome as a surrogate by chart reviews to estimate the performance of a CC classifier.

 

Objective

To determine the prevalence of the sub-syndromes based on the ICD9 classifiers, and to determine the sensitivity, specificity, positive predictive value and negative predictive value of CC classifiers for the sub-syndromes associated with the respiratory and gastrointestinal syndromes using the ICD9 classifier as the criterion standard.

Submitted by elamb on
Description

Syndromic surveillance of emergency department (ED) visit data is often based on computer algorithms which assign patient chief complaints (CC) and ICD code data to syndromes. The triage nurse note (NN) has also been used for surveillance. Previously we developed an “NGram” classifier for syndromic surveillance of ED CC in Italian for detection of natural outbreaks and bioterrorism. The classifier is developed from a set of ED visits for which both the ICD diagnosis code and CC are available by measuring the associations of text fragments within the CC (e.g. 3 characters for a “3-gram”) with a syndromic group of ICD codes. We found good correlation between daily volumes by the ICD10 classifier and estimated by NGrams. However, because the CC was limited to 23 options based on the pick list, it might be possible to obtain results as good as the NGram method or better using a simpler probabilistic approach. Also, in addition to the CC, the Italian data included a free-text NN note. We might be able achieve improved performance by applying the n-gram method to the NN or the CC supplemented by the NN.

 

Objective

Our objective was to compare the performance of the NGram CC classifier to two discrete classifiers based on probabilistic associations with the CC pick list items. Also, we wished to determine the performance of the NGram method applied to CC alone, NN alone, and CC plus NN.

Submitted by elamb on
Description

On 27 April 2005, a simulated bioterrorist event—the aerosolized release of Francisella tularensis in the men’s room of luxury box seats at a sports stadium—was used to exercise the disease surveillance capability of the National Capital Region (NCR). The objective of this exercise was to permit all of the health departments in the NCR to exercise inter-jurisdictional epidemiological investigations using an advanced disease surveillance system. Actual system data could not be used for the exercise as it both is proprietary and contains protected, though de-identified, health information about real people; nor is there much historical data describing how such an outbreak would manifest itself in normal syndromic data. Thus, it was essential to develop methods to generate virtual health care records that met specific requirements and represented both ‘normal’ endemic visits (the background) as well as outbreak-specific records (the injects).

 

Objective

This paper describes a flexible modeling and simulation process that can create realistic, virtual syndromic data for exercising electronic biosurveillance systems.

Submitted by elamb on