Skip to main content

Data Analytics

Description

Graph theory concepts are well established in epidemiology, with particular success as a description of agent-based modeling. An agent-based viewpoint leads to conclusions about the spatial distribution of links: infection is more likely among individuals in close proximity. In this analysis, we seek evidence of these temporal-spatial links though the properties of random geometric graphs.

Our investigation begins with the interpoint distance distribution (IDD) approaches referenced, which provide a promising approach to detect outbreaks that are localized in both space and time. Using a Mahalanobis-based metric, this distribution is compared to an expected distribution derived from historical records.

Unfortunately, when applied to a complex data set such as from Children’s Hospital Boston, the IDD provides inadequate power. Emergency Department chief complaints from 1/1/2000-12/31/2004 were used to identify patients with infectious respiratory illness based on a triage process.

As in most realistic catchments, the historic density of patients varies greatly over the catchment area.

 

Objective

This paper uses geometric random graph concepts to develop early detection algorithms for the real-time detection and localization of outbreaks.

Submitted by elamb on
Description

T-Cube is especially useful for rapidly retrieving responses to ad-hoc queries against large datasets of additive time series labeled using a set of categorical attributes. It can be used as a general tool to support any task requiring access to such data. From the application’s perspective it is transparent: it acts just like the database itself, but an incredibly quickly responding one. The authors had a chance to put T-Cubes into practical use as an enabling technology in applications requiring massive screening of multidimensional temporal data. These applications include two systems to support monitoring of food and agriculture safety and predictive analytics developed at the US Department of Agriculture and the Food and Drug Administration, as well as a system to monitor and forecast health of a fleet of aircraft operated by the US Air Force.

 

Objective

T-Cube, a data structure designed to efficiently represent large collections of temporal data has been shown to benefit surveillance applications involving monitoring sales of over-the-counter medications and emergency department visits. In this paper we present efficiencies which can be realized in practical applications of T-Cube beyond its original areas of deployment, and we advocate a widespread use of it as a technology which makes manual ad-hoc lookups as well as many kinds of complex automated analyses feasible.

Submitted by elamb on
Description

One of the challenges facing developers and users of automated disease surveillance systems is being able to accurately evaluate the performance of their systems for the wide variety of public health threats that are possible. A variety of methods have been used in the past to create data sets for use in testing algorithm performance. Synthetic data has been created using agent-based simulations where data is created based on the hypothesized activity of individuals with contagious diseases. This data is only as accurate as the social models and variety of assumptions which must be made permit. Real data containing elevated levels of respiratory and gastrointestinal activity have been used to evaluate the ability of algorithms to detect the elevated levels. Routine unvalidated outbreaks are typically not public health emergencies and may not represent signals of interest. Another approach is to use real background data and inject a variety of different types of synthetic cases representing various types of outbreaks on top of that background.

With the introduction of the American Health Information Community (AHIC) Minimum Data Set (MDS), the public health surveillance community should have the potential to obtain greater specificity for alerts generated in automated systems. The introduction of these additional data elements increases the complexity of algorithms using linked data elements. Creating synthetic data sets that accurately estimate relationships among chief complaint, pharmacy, laboratory and radiology is an added complexity in creating synthetic outbreaks for performance evaluation.

 

Objective

The objectives of this presentation are to describe the need for synthetic data containing the elements of the AHIC MDS. Approaches for creating synthetic data with MDS data elements will be presented and methods for insuring maintenance of confidentiality will be discussed.

Submitted by elamb on
Description

Seasonal influenza accounts for a high proportion of outpatient morbidity during the winter months. However, influenza case counts are greatly underestimated due to frequently undiagnosed influenza. Electronic medical record (EMR) systems provide a very large, complex data source for influenza surveillance at both the patient and population level. It is important to identify influenza patients for specimen collection, respiratory isolation for school age children, prescription of an appropriate influenza drug, or to identify patients at risk for complications. At a population level, public health agencies monitor the tempo and spread of influenza season for resource management, as well as maintain situational awareness for avian influenza.

 

Objective

The objective of this work was to evaluate the utility of classification tree methods for syndromic surveillance case definition development using an EMR system as a data source.

Submitted by elamb on
Description

Syndromic surveillance systems use residential zip codes for spatial analysis to identify disease clusters. However, the use of emergency medical services can be influenced by geographic proximity, specialty services, and severity of illness. We evaluated zip codes reported to the Boston Public Health Commission’s syndromic surveillance system from 10 Boston emergency departments (EDs).

 

Objective

To examine the distribution of residential zip codes among patients in Boston EDs over a two month period to better understand how this type of spatial analysis may affect the sensitivity of syndromic surveillance.

Submitted by elamb on
Description

Scientists have utilized many chief complaint (CC) classification techniques in biosurveillance including keyword search, weighted keyword search, and naïve Bayes. These techniques may utilize CC-to-syndrome or CC-to-symptom-to-syndrome classification approaches. In the former approach, we classify a CC directly into syndrome categories. In the latter approach, we first classify a CC into symptom categories. Then, we use a syndrome definition, a combination of one or more symptoms, to determine whether or not a chief complaint belongs in a particular syndrome category. One approach to CC-to-symptom-to-syndrome classification uses manually weighted keyword search and Boolean operations to build syndrome classifiers. A limitation to this approach is that it does not address uncertainty in the data and the system is manually parameterized. A CC-tosymptom-to-syndrome approach that is both probabilistic and utilizes machine learning addresses these limitations.

 

Objective

Design, build and evaluate a symptom-based probabilistic chief complaint classifier for the Real-time Outbreak and Disease Surveillance System.

Submitted by elamb on
Description

The 2005 Youth Risk Behavior Survey of 9th to 12th graders in Miami-Dade County public schools found that 69.7% of students tried alcohol, 28.3% tried marijuana, and 6.3% tried cocaine in their lifetime. Results also showed that Hispanics had a higher percentage of usage when compared to Blacks or Whites. The 2007 White House Office of National Drug Control Policy special report entitled “Hispanic Teens and Drugs” also concluded that Hispanics were at the highest risk for substance abuse. With the county’s 60% Hispanic population, this issue is of concern for the community. This is the first study to compare multiple sources of data to describe substance abuse among youth from areas such as healthcare utilization to criminal charges.

Submitted by elamb on
Description

Identifying potential biases and confounders that may affect data quality is an important consideration when evaluating surveillance systems. Having the benefit of predictable temporal trends is a key requirement to improve upon the specificity of detecting outbreaks. Identification of factors that impact on the reliability of the temporal trends observed in the data may provide for the ability to improve the capability to identify aberrations in those trends. During a retrospective study of a dataset of microbiology orders from the veterinary teaching hospital at The Ohio State University for 2003 we noticed regular intervals when increases in the number of culture orders were not accompanied by proportional increases in the number of isolates. These instances appeared to occur at intervals that coincided with the clinical rotation of senior veterinary students within the hospital.

 

Objective

This paper reports on a potential confounder discovered during an investigation of microbiology orders in a veterinary teaching hospital as a possible data source for outbreak detection.

Submitted by elamb on
Description

Numerous methods have been applied to the problem of modeling temporal properties of disease surveillance data; the ESSENCE system contains a widely used approach (1). STL (2) is a flexible, wellproven method for temporal modeling that decomposes the series into frequency components. A periodic component like DW can be exactly periodic or evolve through time. STL is based on loess (3), which can model a numeric response as a function of any explanatory variables. After the STL modeling of the counts, we will add patient address and produce a timespace modeling using both STL and more general loess methods.

 

Objective

Use the STL local-regression (loess) decomposition procedure and transformation to model the univariate time-series characteristics of chief-complaint daily counts as a first step in a time and spatial modeling. Develop visualization tools for model display and checking.

Submitted by elamb on
Description

Bordetella Pertussis outbreaks cause morbidity in all age groups, but the infection is most dangerous for young infants. Pertussis is difficult to diagnose, especially in its early stages, and definitive test results are not available for several days. Because of temporal and geographic variability of pertussis outbreaks, delay in diagnostic test results and ramifications of incorrect management decisions at the point of care, pertussis represents a prototypical disease where realtime public health surveillance data might inform, guide and improve medical decision making. Previously, we showed that diagnostic accuracy for meningitis can be improved when information about recent, local disease incidence is accounted for. Here, we quantify the contribution of epidemiologic context to a clinical prediction model for pertussis using a state public health data stream.

 

Objective

To explore the integration of epidemiological context – current population-level disease incidence data – into a clinical prediction model for pertussis.

Submitted by elamb on