Skip to main content

Data Analytics

Description

Chief complaints are often represented textually and as a mixture of complex and context-dependant lexical symbols with little formal sentence structure. Although human experts usually comprehend this information in its right context intuitively and effortlessly, use of chief complaint data by computers is a challenge. Semantic approaches for text understanding are concerned with the meaning of terms and their relationships, driven from an explicit model rather than their syntactic forms. Explicit representation of domain concepts along with computer reasoning enables a knowledgeable computer agent to identify those concepts in a given text and pinpoint relevant relationships if they make sense according to an existing formal model available to the agent .

Objective

This paper proposes a semantic approach to processing free form text information such as chief complaints using formal knowledge representation and Description Logic reasoning. Our methods extract concepts and as much contextual information as is available in the text. Output consists of a computationally interpretable representation of this information using the Resource Definition Framework (RDF) and UMLS Metathesaurus.

Submitted by elamb on
Description

Our purpose was to develop an ROC curve for public health surveillance similar to those used in diagnostic testing. We developed syndrome surveillance algorithms with differing sensitivity and specificity in detecting the seasonal influenza (ILI) outbreak. For each algorithm we plotted: days to detect the event against the numbers of false positive alarms during the non-ILI season.

Submitted by elamb on
Description

Time series analysis is very common in syndromic surveillance. Large scale biosurveillance systems typically perform thousands of time series queries per day: for example, monitoring of nationwide over-thecounter (OTC) sales data may require separate time series analyses on tens of thousands of zip codes. More complex query types (e.g. queries over various combinations of patient age, gender, and other characteristics, or spatial scans performed over all potential disease clusters) may require millions of distinct queries. Commercial OLAP databases provide data cubes to handle such ad hoc queries, but these methods typically suffer from long build times (typically hours), huge memory requirements (requiring the purchase of high-end database servers), and high maintenance costs. Additionally, data cubes typically require 1 second or more to respond to each complex query. This delay is an inconvenience to users who want to perform multiple queries in an online fashion; additionally, data cubes are far too slow for statistical analyses requiring millions of complex queries, which would require days of processing time.

Objective

We present T-Cube, a new tool for very fast retrieval and analysis of time series data. Using a novel method of data caching, T-Cube performs time series queries approximately 1,000 times faster than standard state-of-the-art data cube technologies. This speedup has two main benefits: it enables fast anomaly detection by simultaneous statistical analysis of many thousands of time series, and it allows public health users to perform many complex, ad hoc time series queries on the fly without inconvenient delays.

Submitted by elamb on
Description

With the widespread deployment of near real time population health monitoring, there is increasing focus on spatial cluster detection for identifying disease outbreaks. These spatial epidemiologic methods rely on knowledge of patient location to detect unusual clusters. In hospital administrative data, patient location is collected as home address but use of this precise location raises privacy concerns. Regional locations, such as center points of zip codes, have been deployed in many existing systems. However, this practice could distort the geographic properties of the raw data and affect subsequent spatial analyses. The impact of location error due to centroid assignment on the statistical analyses underlying these systems requires study.

 

Objective

To investigate the impact of address precision (exact latitude and longitude versus the center points of zip codes) on spatial cluster detection.

Submitted by elamb on
Description

Modern surveillance systems use statistical process control (SPC) charts such as Cumulative Sum and Exponentially Weighted Moving Average charts for monitoring daily counts of such quantities as ICD-9 codes from ED visits, sales of medications, and doctors’ office visits. The working assumption is that such pre-clinical data contain an early signature of disease outbreaks, manifested as an increase in the count levels. However, the direct application of SPC charts to the raw counts leads to unreliable performance. A popular statistical solution is to precondition the data before applying the charts by modeling or removing explainable patterns from the data and then monitoring the residuals. Although the general idea is common practice, the specifics of how to identify the existing explainable components and how to account for them are domain-specific. Therefore, we seek to present a set of modeling and data-driven tools that are useful for syndromic data.

 

Objective

SPC charts are widely used in disease surveillance. The charts are very effective when monitored data meet the requirements of temporal independence, stationarity, and normality. However, when these assumptions are violated, the SPC charts will either fail to detect special cause variations or will alert frequently even in the absence of anomalies. Currently collected biosurveillance data contain predictable factors such as day-of-week effects, seasonal effects, holidays, autocorrelation, and global trends that cause the data to violate these assumptions. This work (1) describes a set of tools for identifying such explainable patterns and (2) examines several data preconditioning methods that account for these factors, yielding data better suited for monitoring by traditional SPC charts.

Submitted by elamb on
Description

Modern biosurveillance relies on multiple sources of both prediagnostic and diagnostic data, updated daily, to discover disease outbreaks. Intrinsic to this effort are two assumptions: (1) the data being analyzed contain early indicators of a disease outbreak and (2) the outbreaks to be detected are not known a priori. However, in addition to outbreak indicators, syndromic data streams include such factors as day-of-week effects, seasonal effects, autocorrelation, and global trends. These explainable factors obscure unexplained outbreak events, and their presence in the data violates standard control-chart assumptions. Monitoring tools such as Shewhart, cumulative sum, and exponentially weighted moving average control charts will alert based largely on these explainable factors instead of on outbreaks. The goal of this paper is 2-fold: first, to describe a set of tools for identifying explainable patterns such as temporal dependence and, second, to survey and examine several data preconditioning methods that significantly reduce these explainable factors, yielding data better suited for monitoring using the popular control charts.

Submitted by elamb on
Description

To recognize outbreaks so that early interventions can be applied, BioSense uses a modification of the EARS C2 method, stratifying days used to calculate the expected value by weekend vs weekday, and including a rate-based method that accounts for total visits. These modifications produce lower residuals (observed minus expected counts), but their effect on sensitivity has not been studied.

 

Objective

To evaluate several variations of a commonlyused control chart method for detecting injected signals in 2 BioSense System datasets.

Submitted by elamb on
Description

Irregularly shaped cluster finders frequently end up with a solution consisting of a large zone z spreading through the map, which is merely a collection of the highest valued regions, but not a geographically sound cluster. One way to amenize this problem is to introduce penalty functions to avoid the excessive freedom of shape of z. The compactness penalty K(z) is a function used to reduce the scan value of irregularly shaped clusters, based on its geometric shape. Another penalty is the cohesion function C(z), a measure of the absence of weak links, or underpopulated regions within the cluster which disconnect it when removed. It was mentioned in that such weak links could be responsible for a diminished power of detection in cluster finder algorithms. Methods using those penalty functions present better performance. The geometric  compactness is not entirely satisfactory, although, because it has a tendency to avoid potentially interesting irregularly shaped clusters, acting as a low-pass filter. The cohesion function penalty method, although, has slightly less specificity.

 

Objective

We introduce a novel spatial scan algorithm for finding irregularly shaped disease clusters maximizing simultaneously two objectives: the regularity of shape and the internal cohesion of the cluster.

Submitted by elamb on
Description

We hypothesize that epidemics around their onset tend to affect primarily a well-defined subgroup of the overall population that is for some reason particularly susceptible. While the vulnerable cohort is often well described for many human diseases, this is not the case for instance when we wish to detect a novel computer virus. Clustering may be used to define the subgroups that will be tested for over-density of symptom occurrence. The clustering slowly changes in response to changes in the population.

 

Objective

This paper describes a method of detecting a slowlygrowing signal in a large population, based on clustering the population into subgroups more homogeneous in their infectious agent susceptibility.

Submitted by elamb on
Description

Free text chief complaints (CCs), which may be recorded in different languages, are an important data source for syndromic surveillance systems. For automated syndromic surveillance, CCs must be classified into predefined syndromic categories to facilitate subsequent data aggregation and analysis. However, CCs in different languages pose technical challenges for the development of multilingual CC classifiers.  We addressed the technical challenges by first developing a ontology-enhanced CC classifier which exploits semantic relations in the Unified Medical Language System (UMLS) to expand the knowledge of a rule-based CC classifier. Based on the ontologyenhanced English CC classifier, a translation module was incorporated to extract symptom-related information in Chinese CCs and translate it into English. This design thus enables the processing of CCs in both English and Chinese. 

Objective  

This paper describes the effort to design and implement a chief complaint (CC) classification system that is capable of processing CCs in both English and Chinese.

Submitted by elamb on