Skip to main content

Text Mining

Description

Despite considerable effort since the turn of the century to develop Natural Language Processing (NLP) methods and tools for detecting negated terms in chief complaints, few standardised methods have emerged. Those methods that have emerged (e.g. the NegEx algorithm) are confined to local implementations with customised solutions. Important reasons for this lack of progress include (a) limited shareable datasets for developing and testing methods (b) jurisdictional data silos, and (c) the gap between resource-constrained public health practitioners and technical solution developers, typically university researchers and industry developers. To address these three problems ISDS, funded by a grant from the Defense Threat Reduction Agency, organized a consultancy meeting at the University of Utah designed to bring together (a) representatives from public health departments, (b) university researchers focused on the development of computational methods for public health surveillance, (c) members of public health oriented non-governmental organisations, and (d) industry representatives, with the goal of developing a roadmap for the development of validated, standardised and portable resources (methods and data sets) for negation detection in clinical text used for public health surveillance.

Objective: This abstract describes an ISDS initiative to bring together public health practitioners and analytics solution developers from both academia and industry to define a roadmap for the development of algorithms, tools, and datasets to improve the capabilities of current text processing algorithms to identify negated terms (i.e. negation detection).

Submitted by elamb on
Description

Ontologies representing knowledge from the public health and surveillance domains currently exist. However, they focus on infectious diseases (infectious disease ontology), reportable diseases (PHSkbFretired) and internet surveillance from news text (BioCaster ontology), or are commercial products (OntoReason public health ontology). From the perspective of biosurveillance text mining, these ontologies do not adequately represent the kind of knowledge found in clinical reports. Our project aims to fill this gap by developing a stand-alone ontology for the public health/biosurveillance domain, which (1) provides a starting point for standard development, (2) is straightforward for public health professionals to use for text analysis, and (3) can be easily plugged into existing syndromic surveillance systems.

 

Objective

To develop an application ontology - the extended syndromic surveillance ontology - to support text mining of ER and radiology reports for public health surveillance. The ontology encodes syndromes, diagnoses, symptoms, signs and radiology results relevant to syndromic surveillance (with a special focus on bioterrorism).

Submitted by hparton on
Description

Emerging event detection is the process of automatically identifying novel and emerging ideas from text with minimal human intervention. With the rise of social networks like Twitter, topic detection has begun leveraging measures of user influence to identify emerging events. Twitter's highly skewed follower/followee structure lends itself to an intuitive model of influence, yet in a context like the Emerging Infections Network (EIN), a sentinel surveillance listserv of over 1400 infectious disease experts, developing a useful model of authority becomes less clear. Who should we listen to on the EIN? To explore this, we annotated a body of important EIN discussions and tested how well 3 models of user authority performed in identifying those discussions. In previous work we proposed a process by which only posts that are based on specific "important" topics are read, thus drastically reducing the amount of posts that need to be read. The process works by finding a set of "bellwether" users that act as indicators for "important" topics and only posts relating to these topics are then read. This approach does not consider the text of messages, only the patterns of user participation. Our text analysis approach follows that of Cataldi et al.[1], using the idea of semantic "energy" to identify emerging topics within Twitter posts. Authority is calculated via PageRank and used to weight each author's contribution to the semantic energy of all terms occurring in within some interval ti. A decay parameter d defines the impact of prior time steps on the current interval.

Objective

To explore how different models of user influence or authority perform when detecting emerging events within a small-scale community of infectious disease experts.

Submitted by elamb on
Description

An expanded ambulatory health record, the Comprehensive Ambulatory Patient Encounter Record (CAPER) will provide multiple types of data for use in DoD ESSENCE. A new type of data not previously available is the Reason for Visit (ROV), a free-text field analogous to the Chief Complaint (CC). Intake personnel ask patients why they have come to the clinic and record their responses. Traditionally, the text should reflect the patient's actual statement. In reality the staff often "translates" the statement and adds jargon. Text parsing maps key words or phrases to specific syndromes. Challenges exist given the vagaries of the English language and local idiomatic usage. Still, CC analysis by text parsing has been successful in civilian settings [1]. However, it was necessary to modify the parsing to reflect the characteristics of CAPER data and of the covered population. For example, consider the Shock/Coma syndrome. Loss of consciousness is relatively common in military settings due to prolonged standing, exertion in hot weather with dehydration, etc., whereas the main concern is shock/coma due to infectious causes. To reduce false positive mappings the parser now excludes terms such as syncope, fainting, electric shock, road march, parade formation, immunization, blood draw, diabetes, hypoglycemic, etc.

Objective

Rather than rely on diagnostic codes as the core data source for alert detection, this project sought to develop a Chief Complaint (CC) text parser to use in the U.S. Department of Defense (DoD) version of the Electronic Surveillance System for Early Notification of Community-Based Epidemics (ESSENCE), thereby providing an alternate evidence source. A secondary objective was to compare the diagnostic and CC data sources for complementarity.

Submitted by elamb on
Description

PyConTextKit is a web-based platform that extracts entities from clinical text and provides relevant metadata - for example, whether the entity is negated or hypothetical - using simple lexical clues occurring in the window of text surrounding the entity. The system provides a flexible framework for clinical text mining, which in turn expedites the development of new resources and simplifies the resulting analysis process. PyConTextKit is an extension of an existing Python implementation of the ConText algorithm, which has been used successfully to identify patients with an acute pulmonary embolism and to identify patients with findings consistent with seven syndromes. Public health practitioners are beginning to have access to clinical symptoms, findings, and diagnoses from the EMR. Making use of this data is difficult, because much of it is in the form of free text. Natural language processing techniques can be leveraged to make sense of this text, but such techniques often require technical expertise. PyConTextKit provides a web-based interface that makes it easier for the user to perform concept identification for surveillance. We describe the development of a web based application - PyConTextKit - to support text mining of clinical reports for public health surveillance.

 

Objective

We describe the development of a web based application - PyConTextKit - to support text mining of clinical reports for public health surveillance.

Submitted by elamb on
Description

Event-based biosurveillance is a practice of monitoring diverse information sources for the detection of events pertaining to human, plant, and animal health. Online documents, such as news articles, newsletters, and (micro-) blog entries, are primary information sources in it. Document classification is an important step to filter information and machine learning methods have been successfully applied to this task.

 

Objective

The objective of this literature review is to identify current challenges in document classification for event-based biosurveillance and consider the necessary efforts and the research opportunity.

Submitted by elamb on
Description

Commonly used syndromic surveillance methods based on the spatial scan statistic first classify disease cases into broad, pre-existing symptom categories ("prodromes") such as respiratory or fever, then detect spatial clusters where the recent case count of some prodrome is unexpectedly high. Novel emerging infections may have very specific and anomalous symptoms which should be easy to detect even if the number of cases is small. However, typical spatial scan approaches may fail to detect a novel outbreak if the resulting cases are not classified to any known prodrome. Alternatively, detection may be delayed because cases are lumped into an overly broad prodrome, diluting the outbreak signal.

 

Objective

We propose a new text-based spatial event detection method, the semantic scan statistic, which uses free-text data from Emergency Department chief complaints to detect, localize, and characterize newly emerging outbreaks of disease.

Submitted by elamb on
Description

Objective

There were two objectives of this analysis. First, apply text-processing methods to free-text clinician notes extracted from the VA electronic medical record for automated detection of Influenza-Like-Illness. Secondly, determine if use of data from free-text clinical documents can be used to enhance the predictive ability of case detection models based on coded data.

Submitted by elamb on
Description

Protecting U.S. animal populations requires constant monitoring of disease events and conditions which might lead to disease emergence, both domestically and globally. Since 1999, the Center for Emerging issues (CEI has actively monitored global information sources to provide early detection impact assessments and increased awareness of emerging disease events and conditions. The importance of these activities was reinforced after September 11, 2001, and these processes are now part of the U.S. Department of Agriculture’s response to Homeland Security Presidential Directive 9. Electronic information sources available through the Internet have recently changed the way animal health information is gathered, processed and shared. To respond to these changes, CEI developed a dynamic system containing automated and semiautomated components that process information from various sources to identify, track, and evaluate emerging disease situations.

 

Objective

This paper describes a system of automatic and semiautomatic processes for data gathering, assessment, and event tracking used by the CEI to enhance monitoring of global animal health events and conditions.

Submitted by elamb on
Description

Case detection from chief complaints suffers from low to moderate sensitivity. Emergency Department (ED) reports contain detailed clinical information that could improve case detection ability and enhance outbreak characterization. We developed a text processing system called Topaz that could be used to answer questions from ED reports, such as: How many new patients have come to the ED with acute lower respiratory symptoms? Of the respiratory patients, how many had a productive cough or wheezing? How many of the respiratory patients have a past history of asthma?

 

Objective

To evaluate how well a text processing system called Topaz can identify acute episodes of 55 clinical conditions described in ED notes.

Submitted by elamb on