Skip to main content

Data Analytics

Description

In Montreal, notifiable diseases are reported to the Public Health Department (PHD). Of 44, 250 disease notifications received in 2009, up to 25% had potential address errors. These can be introduced during transcription, handwriting interpretation and typing at various stages of the process, from patients, labs and/or physicians, and at the PHD. Reports received by the PHD are entered manually (initial entry) into a database. The archive personnel attempts to correct omissions by calling reporting laboratories or physicians. Investigators verify real addresses with patients or physicians for investigated episodes (40–60%). 

The Dracones qualite (DQ) address verification algorithm compares the number, street and postal code against the 2009 Canada Post database. If the reported address is not consistent with a valid address in the Canada Post database, DQ suggests a valid alternative address.

 

Objective

To (1) validate DQ developed to improve data quality for public health mapping and (2) identify the origin of address errors.

Submitted by hparton on
Description

Real-world public health data often provide numerous challenges. There may be a limited amount of background data, data dropouts, noise, and human error. The data from an emergency department (ED) in Urbana, IL includes a diagnosis field with multiple terms and notes separated by semicolons. There are over 7000 distinct terms, excluding the notes. Because it begins in April 2009, there is not yet adequate background data to use some of the regressionbased alerting algorithms. Values for some days are missing, so we also needed an algorithm that would tolerate data dropouts. 

INDICATOR is a workflow-based biosurveillance system developed at the National Center for Supercomputing Applications. One of the fundamental concepts of INDICATOR is that the burden of cleaning and processing incoming data should be on the software, rather than on the health care providers.

 

Objective

This paper compares different approaches with classification and anomaly detection of data from an ED.

Submitted by hparton on
Description

Medication adherence studies typically use pharmacy-dispensing data to infer drug exposures. These studies often require calculations reflecting the intensity and duration of drug exposure. The typical approach to estimating duration of drug exposure is to use dispensing dates and day supply. Often, pharmacy databases have random and/or systematic errors causing improbable calculations. These errors become particularly problematic when estimating medication duration in drugs with complicated dosing schedules. Experts recommending cleaning data or removing erroneous data before analysis, but do not provide instructional guidelines. We developed an algorithmic approach to improve estimation of drug-course duration, dosing and medication possession ratios (MPRs). This study compares estimated MPRs produced by the standard method with MPRs by the algorithmic approach. Methotrexate was chosen as the first drug to implement the algorithm because of its widespread use for rheumatoid arthritis (RA) and for its complexity in dosing schedules.

Submitted by teresa.hamby@d… on
Description

Consider the most likely disease cluster produced by any given method, like SaTScan,  for the detection and inference of spatial clusters in a map divided into areas; if this cluster is found to be statistically significant, what could be said of the external areas adjacent to the cluster? Do we have enough information to exclude them from a health program of prevention? Do all the areas inside the cluster have the same importance from a practitioner perspective? How to access quantitatively the risk of those regions, given that the information we have (cases count) is also subject to variation in our statistical modeling? A few papers have tackled these questions recently; produces confidence intervals for the risk in every area, which are compared with the risks inside the most likely cluster. There exists a crescent demand of interactive software for the visualization of spatial clusters. A technique was developed to visualize relative risk and statistical significance simultaneously.

Objective

Given an aggregated-area map with disease cases data, we propose a criterion to measure the plausibility of each area in the map of being part of a possible localized anomaly.

Submitted by uysz on
Description

The multivariate Bayesian scan statistic (MBSS) enables timely detection and characterization of emerging events by integrating multiple data streams. MBSS can model and differentiate between multiple event types: it uses Bayes’ Theorem to compute the posterior probability that each event type Ek has affected each space-time region S. Results are visualized using a ‘posterior probability map’ showing the total probability that each location has been affected. Although the original MBSS method assumes a uniform prior over circular regions, and thus loses power to detect elongated and irregular clusters, our Fast Subset Sums (FSS) method assumes a hierarchical prior, which assigns non-zero prior probabilities to every subset of locations, substantially improving detection power and accuracy for irregular regions.

Objective

We propose a new, computationally efficient Bayesian method for detection and visualization of irregularly shaped clusters. This Generalized Fast Subset Sums (GFSS) method extends our recently proposed MBSS and FSS approaches, and substantially improves timeliness and accuracy of event detection.

Submitted by teresa.hamby@d… on
Description

It has been suggested that changes in various socioeconomic, environmental and biological factors have been drivers of emerging and reemerging infectious disease, although few have assessed these relationships on a global scale. Understanding these associations could help build better forecasting models, and therefore identify high-priority regions for public health and surveillance implementation. Although infectious disease surveillance and research have tended to be concentrated in wealthier, developed countries in North America, Europe and Australia, it is developing countries that have been predicted to be the next hotspots for emerging infectious diseases.

Objective

To evaluate the association between socioeconomic factors and infectious disease outbreaks, to develop a prediction model for where future outbreaks would most likely to occur worldwide and identify priority countries for surveillance capacity building.

Submitted by teresa.hamby@d… on
Description

Accurately assigning causes or contributing causes to deaths remains a universal challenge, especially in the elderly with underlying disease. Cause of death statistics commonly record the underlying cause of death, and influenza deaths in winter are often attributed to underlying circulatory disorders. Estimating the number of deaths attributable to influenza is, therefore, usually performed using statistical models. These regression models (usually linear or poisson regression are applied) are flexible and can be built to incorporate trends in addition to influenza virus activity such as surveillance data on other viruses, bacteria, pure seasonal trends and temperature trends.

 

Objective

Mortality exhibits clear seasonality mainly caused by an increase in deaths in the elderly in winter. As there may be substantial hidden mortality for a number of common pathogens, we estimated the number of elderly deaths attributable to common seasonal viruses and bacteria for which robust weekly laboratory surveillance data were available.

Submitted by hparton on
Description

Our laboratory previously established the value of over-the-counter (OTC) sales data for the early detection of disease outbreaks. We found that thermometer sales (TS) increased significantly and early during influenza (flu) season. Recently, the 2009 H1N1 outbreak has highlighted the need for developing methods that not only detect an outbreak but also estimate incidence so that public-health decision makers can allocate appropriate resources in response to an outbreak. Although a few studies have tried to estimate the H1N1 incidence in the 2009 outbreak, these were done months afterward and were based on data that are either not easy to collect or not available in a timely fashion (for example, surveys or confirmed laboratory cases).

Here, we explore the hypothesis that OTC sales data can also be used for predicting a disease activity. Towards that end, we developed a model to predict the number of Emergency Departments (ED) flu cases in a region based on TS. We obtain sales information from the National Retail Data Monitor (NRDM) project. NRDM collects daily sales data of 18 OTC categories across the US.

 

Objective

We developed a model that predicts the incidence of flu cases that present to ED in a given region based on TS.

Submitted by hparton on
Description

The spatial scan statistic detects significant spatial clusters of disease by maximizing a likelihood ratio statistic over a large set of spatial regions. Several recent approaches have extended spatial scan to multiple data streams. Burkom aggregates actual and expected counts across streams and applies the univariate scan statistic, thus assuming a constant risk for the affected streams. Kulldorff et al. separately apply the univariate statistic to each stream and then aggregate scores across streams, thus assuming independent risks for each affected stream. Neill proposes a ‘fast subset scan’ approach, which maximizes the scan statistic over proximity-constrained subsets of locations, improving the timeliness of detection for irregularly shaped clusters. In the univariate event detection setting, many commonly used scan statistics satisfy the ‘linear-time subset scanning’ (LTSS) property, enabling exact and efficient detection of the highest-scoring space-time clusters.

Objective

We extend the recently proposed ‘fast subset scan’ framework from univariate to multivariate data, enabling computationally efficient detection of irregular space-time clusters even when the numbers of spatial locations and data streams are large. These fast algorithms enable us to perform a detailed empirical comparison of two variants of the multivariate spatial scan statistic, demonstrating the tradeoffs between detection power and characterization accuracy

Submitted by teresa.hamby@d… on
Description

Syndromic surveillance typically involves collecting time-stamped transactional data, such as patient triage or examination records or pharmacy sales. Such records usually span multiple categorical features, such as location, age group, gender, symptoms, chief complaints, drug category and so on. The key analytic objective to identify potential disease clusters in such data observed recently (for example during last one week) as compared with baseline (for example derived from data observed over previous few months). In real world scenarios, a disease outbreak can impact any subset of categorical dimensions and any subset of values along each categorical dimension. As evaluating all possible outbreak hypotheses can be computationally challenging, popular state-of-the-art algorithms either limit the scope of search to exclusively conjunctive definitions or focus only on detecting spatially co-located clusters for disease outbreak detection. Further, it is also common to see multiple disease outbreaks happening simultaneously and affecting overlapping subsets of dimensions and values. Most such algorithms focus on finding just one most significant anomalous cluster corresponding to a possible disease outbreak, and ignore the possibility of a concurrent emergence of additional clusters.

 

Objective

We present Disjunctive Anomaly Detection (DAD), a novel algorithm to detect multiple overlapping anomalous clusters in large sets of categorical time series data. We compare performance of DAD and What’s Strange About Recent Events on a disease surveillance data from Sri Lanka Ministry of Health.

Submitted by hparton on