Data Analytics

Real-Time Biosurveillance Program (RTBP) introduces modern surveillance technology to health departments in Sri Lanka and Tamil Nadu, India. Triage data from each patient visit (basic demographics, signs, symptoms, preliminary diagnoses) is recorded on paper at health facilities. Case records are transmitted daily to a central database using the RTBP mobile phone application. It is done by medical professionals in India, but in Sri Lanka, due to staffing constraints, the same duty is performed by lower cost personnel with limited domain knowledge. That results in noticeable differences in data entry error rates between the two locations. Most of such issues are due to systematic and subjectivemisinterpretations of the handwritten doctor notes by the data entry personnel. If not identified and remedied quickly, these errors can adversely affect accuracy and timeliness of health events detection. There is a need to support system managers in their efforts to maintain high reliability of data used for public health surveillance.

Objective

We present a method for automated detection of systematic data entry errors in real time biosurveillance.

Referenced File

Automated_Detection_Of_Data_Entry_Errors_In_A_Real_Time_Surveillance_System.pdf

Submitted by hparton on Tue, 06/18/2019 - 12:08

In Montreal, notifiable diseases are reported to the Public Health Department (PHD). Of 44, 250 disease notifications received in 2009, up to 25% had potential address errors. These can be introduced during transcription, handwriting interpretation and typing at various stages of the process, from patients, labs and/or physicians, and at the PHD. Reports received by the PHD are entered manually (initial entry) into a database. The archive personnel attempts to correct omissions by calling reporting laboratories or physicians. Investigators verify real addresses with patients or physicians for investigated episodes (40–60%).

The Dracones qualite (DQ) address verification algorithm compares the number, street and postal code against the 2009 Canada Post database. If the reported address is not consistent with a valid address in the Canada Post database, DQ suggests a valid alternative address.

Objective

To (1) validate DQ developed to improve data quality for public health mapping and (2) identify the origin of address errors.

Referenced File

Assessing_Address_Data_Quality_For_Public_Health_Surveillance_In_Montreal.pdf

Submitted by hparton on Tue, 06/18/2019 - 11:49

Real-world public health data often provide numerous challenges. There may be a limited amount of background data, data dropouts, noise, and human error. The data from an emergency department (ED) in Urbana, IL includes a diagnosis field with multiple terms and notes separated by semicolons. There are over 7000 distinct terms, excluding the notes. Because it begins in April 2009, there is not yet adequate background data to use some of the regressionbased alerting algorithms. Values for some days are missing, so we also needed an algorithm that would tolerate data dropouts.

INDICATOR is a workflow-based biosurveillance system developed at the National Center for Supercomputing Applications. One of the fundamental concepts of INDICATOR is that the burden of cleaning and processing incoming data should be on the software, rather than on the health care providers.

Objective

This paper compares different approaches with classification and anomaly detection of data from an ED.

Referenced File

Applying_Classification_And_Anomaly_Detection_Techniques_To_Real_World_Data.pdf

Submitted by hparton on Tue, 06/18/2019 - 11:41

Medication adherence studies typically use pharmacy-dispensing data to infer drug exposures. These studies often require calculations reflecting the intensity and duration of drug exposure. The typical approach to estimating duration of drug exposure is to use dispensing dates and day supply. Often, pharmacy databases have random and/or systematic errors causing improbable calculations. These errors become particularly problematic when estimating medication duration in drugs with complicated dosing schedules. Experts recommending cleaning data or removing erroneous data before analysis, but do not provide instructional guidelines. We developed an algorithmic approach to improve estimation of drug-course duration, dosing and medication possession ratios (MPRs). This study compares estimated MPRs produced by the standard method with MPRs by the algorithmic approach. Methotrexate was chosen as the first drug to implement the algorithm because of its widespread use for rheumatoid arthritis (RA) and for its complexity in dosing schedules.

Referenced File

Impact_Of_Including_Physicians_Prescribing_Directions_On_Calculations_Of_Medication_Posession_Ratios.pdf

Submitted by teresa.hamby@d… on Tue, 06/18/2019 - 11:29

Consider the most likely disease cluster produced by any given method, like SaTScan, for the detection and inference of spatial clusters in a map divided into areas; if this cluster is found to be statistically significant, what could be said of the external areas adjacent to the cluster? Do we have enough information to exclude them from a health program of prevention? Do all the areas inside the cluster have the same importance from a practitioner perspective? How to access quantitatively the risk of those regions, given that the information we have (cases count) is also subject to variation in our statistical modeling? A few papers have tackled these questions recently; produces confidence intervals for the risk in every area, which are compared with the risks inside the most likely cluster. There exists a crescent demand of interactive software for the visualization of spatial clusters. A technique was developed to visualize relative risk and statistical significance simultaneously.

Objective

Given an aggregated-area map with disease cases data, we propose a criterion to measure the plausibility of each area in the map of being part of a possible localized anomaly.

Referenced File

Non_Parametric_Intensity_Bounds_For_The_Visualization_Of_Disease_Clusters.pdf

Submitted by uysz on Tue, 06/18/2019 - 10:03

The multivariate Bayesian scan statistic (MBSS) enables timely detection and characterization of emerging events by integrating multiple data streams. MBSS can model and differentiate between multiple event types: it uses Bayes’ Theorem to compute the posterior probability that each event type Ek has affected each space-time region S. Results are visualized using a ‘posterior probability map’ showing the total probability that each location has been affected. Although the original MBSS method assumes a uniform prior over circular regions, and thus loses power to detect elongated and irregular clusters, our Fast Subset Sums (FSS) method assumes a hierarchical prior, which assigns non-zero prior probabilities to every subset of locations, substantially improving detection power and accuracy for irregular regions.

Objective

We propose a new, computationally efficient Bayesian method for detection and visualization of irregularly shaped clusters. This Generalized Fast Subset Sums (GFSS) method extends our recently proposed MBSS and FSS approaches, and substantially improves timeliness and accuracy of event detection.

Referenced File

Generalized_Fast_Subset_Sums_For_Batesian_Detection_And_Visualization.pdf

Submitted by teresa.hamby@d… on Mon, 06/17/2019 - 14:52

It has been suggested that changes in various socioeconomic, environmental and biological factors have been drivers of emerging and reemerging infectious disease, although few have assessed these relationships on a global scale. Understanding these associations could help build better forecasting models, and therefore identify high-priority regions for public health and surveillance implementation. Although infectious disease surveillance and research have tended to be concentrated in wealthier, developed countries in North America, Europe and Australia, it is developing countries that have been predicted to be the next hotspots for emerging infectious diseases.

Objective

To evaluate the association between socioeconomic factors and infectious disease outbreaks, to develop a prediction model for where future outbreaks would most likely to occur worldwide and identify priority countries for surveillance capacity building.

Referenced File

Forecasting_High_Priority_Surveillance_Regions_A_Socioeconomic_Model.pdf

Submitted by teresa.hamby@d… on Mon, 06/17/2019 - 14:36

Accurately assigning causes or contributing causes to deaths remains a universal challenge, especially in the elderly with underlying disease. Cause of death statistics commonly record the underlying cause of death, and influenza deaths in winter are often attributed to underlying circulatory disorders. Estimating the number of deaths attributable to influenza is, therefore, usually performed using statistical models. These regression models (usually linear or poisson regression are applied) are flexible and can be built to incorporate trends in addition to influenza virus activity such as surveillance data on other viruses, bacteria, pure seasonal trends and temperature trends.

Objective

Mortality exhibits clear seasonality mainly caused by an increase in deaths in the elderly in winter. As there may be substantial hidden mortality for a number of common pathogens, we estimated the number of elderly deaths attributable to common seasonal viruses and bacteria for which robust weekly laboratory surveillance data were available.

Referenced File

Estimating_The_Number_Of_Deaths_Attributable_To_Nine_Common_Infectious_Pathogens_Adjusted_For_Seasonality_And_Temperature.pdf

Submitted by hparton on Mon, 06/17/2019 - 12:08

Our laboratory previously established the value of over-the-counter (OTC) sales data for the early detection of disease outbreaks. We found that thermometer sales (TS) increased significantly and early during influenza (flu) season. Recently, the 2009 H1N1 outbreak has highlighted the need for developing methods that not only detect an outbreak but also estimate incidence so that public-health decision makers can allocate appropriate resources in response to an outbreak. Although a few studies have tried to estimate the H1N1 incidence in the 2009 outbreak, these were done months afterward and were based on data that are either not easy to collect or not available in a timely fashion (for example, surveys or confirmed laboratory cases).

Here, we explore the hypothesis that OTC sales data can also be used for predicting a disease activity. Towards that end, we developed a model to predict the number of Emergency Departments (ED) flu cases in a region based on TS. We obtain sales information from the National Retail Data Monitor (NRDM) project. NRDM collects daily sales data of 18 OTC categories across the US.

Objective

We developed a model that predicts the incidence of flu cases that present to ED in a given region based on TS.

Referenced File

Estimating_The_Incidence_Of_Influenza_Cases_That_Present_To_Emergency_Departments.pdf

Submitted by hparton on Mon, 06/17/2019 - 12:03

The spatial scan statistic detects significant spatial clusters of disease by maximizing a likelihood ratio statistic over a large set of spatial regions. Several recent approaches have extended spatial scan to multiple data streams. Burkom aggregates actual and expected counts across streams and applies the univariate scan statistic, thus assuming a constant risk for the affected streams. Kulldorff et al. separately apply the univariate statistic to each stream and then aggregate scores across streams, thus assuming independent risks for each affected stream. Neill proposes a ‘fast subset scan’ approach, which maximizes the scan statistic over proximity-constrained subsets of locations, improving the timeliness of detection for irregularly shaped clusters. In the univariate event detection setting, many commonly used scan statistics satisfy the ‘linear-time subset scanning’ (LTSS) property, enabling exact and efficient detection of the highest-scoring space-time clusters.

Objective

We extend the recently proposed ‘fast subset scan’ framework from univariate to multivariate data, enabling computationally efficient detection of irregular space-time clusters even when the numbers of spatial locations and data streams are large. These fast algorithms enable us to perform a detailed empirical comparison of two variants of the multivariate spatial scan statistic, demonstrating the tradeoffs between detection power and characterization accuracy

Referenced File

Fast_Subset_Scan_For_Multivariate_Spatial_Biosurveillance.pdf

Submitted by teresa.hamby@d… on Mon, 06/17/2019 - 10:39

Subscribe to Data Analytics