Skip to main content

Data Analytics

Description

HealthMap (www.healthmap.org) is a freely accessible, automated real-time system that monitors, organizes, integrates, filters, and maps online news about emerging diseases. The system performs geographic parsing (“geo-parsing”) of disease outbreaks by assigning incoming alerts to low resolution geographic descriptions, such as  country, with the help of a purposely crafted gazetteer. However, the system is limited by the size of the gazetteer, precluding high resolution assignment of place. In this study, we use the prior knowledge encoded in the gazetteer to expand the capabilities of the geo-parsing system.

 

Objective

Discovering geographic references in text is a task that human readers perform using both their lexical and contextual knowledge. Automating this task for real-time surveillance of informal sources on epidemic intelligence therefore requires efforts beyond dictionary-based pattern matching. Here, we describe an automated approach to learning the particular context in which outbreak locations appear and by this means extending prior knowledge encoded in a gazetteer.

Submitted by elamb on
Description

The goal of this project is to compare automated syn-dromic surveillance queries using raw chief complaints to those pre-processed with the Emergency Medical Text Processor (EMT-P) system.

Submitted by elamb on
Description

In the last decade, time series analysis has become one of the most important tools of surveillance systems. Understanding the nature of temporal fluctuations is essential for successful development of outbreak detection algorithms, aberration assessment, and to control for seasonal variations. Typically, in applying the time series methods to health outcomes collected over an extended period of time it is assumed that population profiles remain constant. In practice, such assumptions have been rarely tested. At best, the temporal analysis is performed using stratification by age or other discriminating factors if heterogeneity is suspected. Any community can experience population changes in various forms. Long-term trends of inflow/outflow migration and rapid transient fluctuations associated with specific events are typical examples of changes in population profile. Seasonality, as an intrinsic property of infectious diseases manifestation in a community, is typically attributed to periodic changes in transmissibility of pathogens. To some extent, seasonal fluctuations in the incidence of infectious diseases could also be associated with the changes in population profiles. The ability to detect and describe such changes would provide valuable clues into seasonally changing factors associated with an infection.

 

Objective

The objective of this communication is two-fold: 1) to introduce an analytical approach for assessing temporal changes in the surveillance reporting with respect to population profile; and 2) to demonstrate the utility of this method using laboratory-confirmed cases for four reportable enteric infections (cryptosporidiosis, giardiasis, shigellosis, and salmonellosis) recorded by the Massachusetts Department of Public Health over the last 12 years. This new approach for assessing seasonal changes is based on comparison of gender-specific single-year age distributions, which constitute population profiles.

Submitted by elamb on
Description

Public health officials are now receiving more data than ever in electronic formats, and also stand to benefit more than ever from ongoing advances in the medical and epidemiological sciences. At the same time, this growing body of knowledge as well as volatile world events present an increasingly complex set of threats to population health. As a consequence, public health officials are finding that they need to ask many more, and more complex, questions of their data in order to keep sight of the state of the public’s health. Most current disease surveillance systems enable users to ask many different questions of health data, but are limited in that users can only extract results one question, or query, at a time.



Objective

Develop an Automated Data Query tool to allow public health officials to easily extract batches of raw medical encounter data using custom queries that the officials themselves set up. Additionally, the tool shall be capable of running anomaly detection algorithms against the raw data and returning the statistics. Users shall be able to perform their own analyses on the data and/or the statistical results after using the tool to collect the information efficiently. The tool will help them spot trends of interest that may be specific to their own jurisdictions.

Submitted by elamb on
Description

Evidence suggests that transmission within the workplace contributes significantly to the magnitude of a pandemic flu epidemic. A significant number of large organizations have a pandemic plan in place which may help in controlling this manner of transmission. These plans typically include telecommuting and other measures to reduce the need to physically commute to the workplace. Good data are needed in order to obtain valid results from simulation models and to be able to assess the effect of reductions in commuting.

 

Objective

The objective in this study was to explore data on employment and commuting from different sources, using statistical analytic techniques together with geographical experts to obtain information to be provided to modelers in order to help them improve the employment and commuting component of their models, determine potential issues related to these data, and identify problem areas where further investigation is needed.

Submitted by elamb on
Description



SaTScan is a freely available software that uses the scan statistic to detect clusters in space, time or space-time. SaTScan uses Monte Carlo hypothesis testing in order to produce a p-value for the null hypothesis that no clusters are present. Monte Carlo hypothesis testing can be a powerful tool when asymptotic theoretical distributions are inconvenient or impossible to discover; the main drawback to this approach is that precision for small p-values can only be obtained through greatly increasing the number of Monte Carlo replications, which is both  computer-intensive and time consuming. Depending on the type of analysis being done, the number of geographical areas included, the amount of historical data, and the number of Monte Carlo replications, SaTScan can take anywhere from seconds to hours to run. In doing daily surveillance of many syndromes, we need to limit the amount of time it takes to generate each p-value while still retaining enough precision in the p-value to determine how unusual a cluster is. Since the type of analysis done and the geographic regions being used cannot be changed in most cases, we focus here on trying to reduce the number of Monte Carlo replicates needed.

 

Objective

Our goal was to increase the precision of the p-value produced from SaTScan while reducing the amount of CPU time needed by decreasing the number of Monte Carlo replicates.

Submitted by elamb on
Description

Many cities in the US and the Center for Disease Control and Prevention have deployed biosurveillance systems to monitor regional health status. Biosurveillance systems rely on algorithms that analyze data in temporal domain (e.g., CuSUM) and/or spatial domain (e.g., SaTScan). Spatial domain-based algorithms often require population information to normalize the counts (e.g., emergency department visits) within a geographic region. This paper presents a new algorithm Ellipse-based Clustering Analysis (ECA) that analyzes data in both temporal and spatial domains--using time series analysis for each of zip codes with abnormal counts and using pattern recognition methods for spatial clusters.

 

Objective

This paper describes a new clustering algorithm ECA, which uses a time series algorithm to identify zip codes with abnormal counts, and uses a pattern recognition method to identify spatial clusters in ellipse shapes. Using ellipses could help detect elongated clusters resulting from wind dispersion of bio-agents. We applied the ECA to over-the-counter medicine sales. The pilot study demonstrated the potential use of the algorithm in detection of clustered outbreak regions that could be associated with aerosol release of bio-agents.

Submitted by elamb on
Description

The effectiveness of public health interventions during a disease outbreak depends on rapid, accurate characterization of the initial outbreak and spread of the pathogen. Computer-based simulation using mathematical models provides a means to characterize both and enables practitioners to test intervention strategies. While compartmental differential equation models can be used to represent epidemics, they are unsuitable for early time simulations (first few days) when a small number of people are infected (and even fewer symptomatic), nor are they capable of representing spatial disease spread. Numerous models for disease propagation have been explored, including national scale network models for influenza and social network-based and probabilistic models for smallpox. To be useful in a public health context, a model for disease propagation should be efficient (e.g., simulating several weeks of real time in an hour) and flexible enough to simultaneously represent multiple diseases and attack scenarios.

 

Objective

This paper describes biologically-based mathematical models and efficient methods for early epoch simulation of disease outbreaks and bioterror attacks.

Submitted by elamb on
Description

An important problem in biosurveillance is the early detection and characterization of outdoor aerosol releases of B. anthracis. The Bayesian Aerosol Release Detector (BARD) is a system for simulating, detecting and characterizing such releases. BARD integrates the analysis of medical surveillance data and meteorological data. The existing version of BARD does not account for the fact that many people might be exposed at a location other than their residence due to mobility. Incorporation of a mobility model in biosurveillance has been investigated by several other researchers. In this paper, we describe a refined version of the BARD simulation algorithm which incorporates a model of work-related mobility and report the results of an experiment to measure the effect of this refinement.

 

Objective 

To refine the simulation algorithm used in the BARD so that it takes into account the work-related mobility and to compare the refined simulator with the existing one.

Submitted by elamb on
Description

Syndromic surveillance is focused upon organizing data into categories to detect medium to large scale clusters of illness. Detection often requires that a critical threshold be surpassed. Data mining searches through data to identify records containing keywords. New Hampshire has combined data mining with syndromic surveillance since January 2003 to improve detection capacity.

 

Objective

1. Understand the principles behind the use of syndromic surveillance and data mining. 2. Understand how New Hampshire's unique approach combining data mining with syndromic surveillance has enhanced disease surveillance efforts. 3. Describe the steps and code necessary to implement and enhance data mining.

Submitted by elamb on