Skip to main content

Cluster Detection

Description

The New York City Department of Health and Mental Hygiene (NYC DOHMH) collects data daily from 50 of 61 (82%) emergency departments (EDs) in NYC representing 94% of all ED visits (avg daily visits ~10,000). The information collected includes the date and time of visit, age, sex, home zip code and chief complaint of each patient. Observations are assigned to syndromes based on the chief complaint field and are analyzed using SaTScan to identify statistically significant clusters of syndromes at the zip code and hospital level. SaTScan employs a circular spatial scan statistic and clusters that are not circular in nature may be more difficult to detect. FlexScan employs a flexible scan statistic using an adjacency matrix design.

 

Objective

To use the NYC DOHMH's ED syndromic surveillance data to evaluate FleXScan’s flexible scan statistic and compare it to results from the SaTScan circular scan. A second objective is to improve cluster detection in by improving geographic characteristics of the input files.

Submitted by elamb on
Description

The space-time scan statistic is a powerful statistical tool for prospective disease surveillance. It searches over a set of spatio-temporal regions (each representing some spatial area S for the last k days), finding the most significant regions (S, k) by maximizing a likelihood ratio statistic, and computing p-values of these potential clusters by randomization.

The standard, "population-based" method assumes that, for each spatial location si on each day t, we have a population pti and a count (observed number of cases) cti. Then, under the null hypothesis of no clusters, we expect each count cti to be proportional to its population pti. We then search for regions (S, k) with disease rate (cases per unit population) significantly higher inside the region than outside. In the original space-time scan statistic, the populations are assumed to be given, and in, populations are estimated assuming independence of space and time.

Here we propose an alternative, "expectation-based" method, in which we infer the expected number of cases bti in each spatial location, based on the time series of previous counts. In this case, under the null hypothesis of no clusters, we expect each count cti to be equal to bti, rather than proportional to population. We then search for regions (S, k) with counts that are significantly higher than expected.

 

Objective

This paper describes a new class of space-time scan statistics designed for rapid detection of emerging disease clusters. We evaluate these methods on the task of prospective disease surveillance, and show that our methods consistently outperform the standard space-time scan statistic approach.

Submitted by Sandra.Gonzale… on
Description

The increasing use of the Internet to arrange sexual encounters presents challenges to public health agencies formulating STD interventions, particularly in the context of anonymous encounters. These encounters complicate or break traditional interventions. In previous work [1], we examined a corpus of anonymous personal ads seeking sexual encounters from the classifieds website Craigslist and presented a way of linking multiple ads posted across time to a single author. The key observation of our approach is that some ads are simply reposts of older ads, often updated with only minor textual changes. Under the presumption that these ads, when not spam, originate from the same author, we can use efficient near-duplicate detection techniques to cluster ads within some threshold similarity. Linking ads in this way allows us to preserve the anonymity of authors while still extracting useful information on the frequency with which authors post ads, as well as the geographic regions in which they seek encounters. While this process detects many clusters, the lack of a true corpus of authorship-linked ads makes it difficult to validate and tune the parameters of our system. Fortunately, many ad authors provide an obfuscated telephone number in ad text (e.g., 867-5309 becomes 8sixseven5three oh nine) to bypass Craigslist filters, which prohibit including phone numbers in personal ads. By matching phone numbers of this type across all ads, we can create a corpus of ad clusters known to be written by a single author. This authorship corpus can then be used to evaluate and tune our existing near-duplicate detection system, and in the future identify features for more robust authorship attribution techniques.

Objective:

This paper constructs an authorship-linked collection or corpus of anonymous, sex-seeking ads found on the classifieds website Craigslist. This corpus is then used to validate an authorship attribution approach based on identifying near duplicate text in ad clusters, providing insight into how often anonymous individuals post sexseeking ads and where they meet for encounters.

Submitted by Magou on
Description

TOA identifies clusters of patients arriving to a hospital ED within a short temporal interval. Past implementations have been restricted to records of patients with a specific type of complaint. The Florida Department of Health uses TOA at the county level for multiple subsyndromes (1). In 2011, NC DPH, CCHI and CDC collaborated to enhance and evaluate this capability for NC DETECT, using NC DETECT data in BioSense 1.0 (2). After this successful evaluation based on exposure complaints, discussions were held to determine the best approach to implement this new algorithm into the production environment for NC DETECT. NC DPH was particularly interested in determining if TOA could be used for identifying clusters of ED visits not filtered by any syndrome or sub-syndrome. In other words, can TOA detect a cluster of ED visits relating to a public health event, even if symptoms from that event are not characterized by a predefined syndrome grouping? Syndromes are continuously added to NC DETECT but a syndrome cannot be created for every potential event of public health concern. This TOA approach is the first attempt to address this issue in NC DETECT. The initial goal is to identify clusters of related ED visits whose keywords, signs and/or symptoms are NOT all expressed by a traditional syndrome, e.g. rash, gastrointestinal, and flu-like illnesses. The goal instead is to identify clusters resulting from specific events or exposures regardless of how patients present – event concepts that are too numerous to pre-classify.

Objective:

To describe a collaboration with the Johns Hopkins Applied Physics Laboratory (JHU APL), the North Carolina Division of Public Health (NC DPH), and the UNC Department of Emergency Medicine Carolina Center for Health Informatics (CCHI) to implement time-of-arrival analysis (TOA) for hospital emergency department (ED) data in NC DETECT to identify clusters of ED visits for which there is no pre-defined syndrome or sub-syndrome.

 

Submitted by Magou on
Description

Irregularly shaped spatial disease clusters occur commonly in epidemiological studies, but their geographic delineation is poorly defined. Most current spatial scan software usually displays only one of the many possible cluster solutions with different shapes, from the most compact round cluster to the most irregularly shaped one, corresponding to varying degrees of penalization parameters imposed to the freedom of shape. Even when a fairly complete set of solutions is available, the choice of the most appropriate parameter setting is left to the practitioner, whose decision is often subjective.

 

Objective

We propose a novel approach to the delineation of irregularly shaped disease clusters, treating it as a multi-objective optimization problem. We present a new insight into the geographic meaning of the cluster solution set, providing a quantitative approach to the problem of selecting the most appropriate solution among the many possible ones.

Submitted by elamb on
Description

The use of syndromic surveillance in Tulsa County began as an attempt to identify symptoms associated with Category A agents, namely Anthrax. The underlying premise for adopting the system was the hope that an astute clinician, upon observing clusters of cases exhibiting certain symptoms, would rapidly notify the local health department so that an epidemiological investigation could be initiated. The system is also designed to send spatial and temporal alerts when cases of pre-defined syndromes are observed. Since 2002, when the system was first implemented, Tulsa Health Department has looked for other ways to integrate syndromic surveillance into its daily operations, and to expand its focus from an exclusive bioterrorism tool, to one that is broader in scope. One such way has been to  utilize the system to identify other syndromes and conditions. Collected emergency data has therefore, been used to identify occurrences of animal bites, mental conditions etc. This paper addresses the use of syndromic surveillance for the identification of heat-related illnesses during the hot Oklahoma summer months.

 

Objective

This paper describes the application of syndromic surveillance methodologies to identify nonbioterrorism syndromes particularly, the incidence of heat-related syndromes during the hot Oklahoma summer months.

Submitted by elamb on
Description

Safe drinking water is essential for all communities. Intentional or unintentional contamination of drinking water requires water utilities and local public health to act quickly. The Water Security (WS) initiative of the U.S. Environmental Protection Agency is a multi-faceted approach involving water utilities and local public health officials (LPH) to identify, communicate, contain, and mitigate a drinking water contamination event. Components of WS include: online water quality monitoring, enhanced security monitoring, consumer complaint surveillance, and innovative uses of public health surveillance data streams. LPH already use multiple surveillance data systems to recognize disease events in a timely manner. However, few of these systems can be integrated or specifically designed for detection of drinking water contamination incidents.

 

Objective

This poster describes the integration of public health surveillance data as a component of an early warning system for detection of a drinking water contamination incident.

Submitted by elamb on
Description

In this study, we compare two methods of generating grid points to enable efficient geographic cluster detection when the original geographical data are prohibitively numerous. One method generates uniform grid points, and the other employs quad trees to generate non-uniform grid points. We observe differences in the results of the spatial scan approach to cluster detection for both of these grid generation schemes. In both our simulated experiment, and our analysis of real data, the grid generation schemes produce different results. Generally speaking, the quad tree scheme is more sensitive to detecting high resolution spatial clusters than the uniform scheme. The quad tree grid point scheme may be a useful alternative to the uniform (and other) grid point generation schemes when it is important to set up a surveillance system sensitive to clusters at unspecified spatial resolutions. The quad tree grid scheme may also be useful in a number of other geographic surveillance applications.

Submitted by elamb on
Description

Although Electronic Surveillance System for the Early Notification of Community Based Epidemics (ESSENCE) provides tools to detect a significant alert regarding an unusual public health event, combining that information with other surveillance data, such as 911 calls, school absenteeism and poison control records, has proved to be more sensitive in detecting an outbreak. On Monday, June 16, Florida Poison Information Network, which takes after-hours and weekend calls for Miami-Dade County Health Department (MDCHD), contacted the Office of Epidemiology and Disease Control about five homeless persons that visited the same hospital simultaneously with gastrointestinal symptoms on Saturday, June 14. Poison control staff asked MDCHD to investigate further to determine whether it was an outbreak.

 

Objective

To illustrate how MDCHD utilized ESSENCE in order to track a gastrointestinal outbreak in a homeless shelter.

Submitted by elamb on
Description

There has been much recent interest in using disease signatures to better recognize disease outbreaks. Conversely, the metrics used to describe these signatures can also be used to better characterize the outbreaks. Recent work at the New York City Department of Health has shown the ability to identify characteristic age-specific patterns during influenza outbreaks. One issue that remains is how to implement a search for such patterns using prospective outbreak detection tools such as SatScan.

A potential approach to this problem arises from another currently active research area: the simultaneous use of multiple datastreams. One form of this is to disaggregate a data stream with respect to a third variable such as age. Two drawbacks to this approach are that the categories used to make the streams have to be defined a priori and that relationships between the streams cannot be exploited. Furthermore, the resulting description is less rich as it describes outbreaks in a few non-overlapping age-specific streams. It would be desirable to look for age specific patterns with the age groupings implicitly defined.

 

Objective

This paper presents an implementation of a citywide SatScan analysis that uses age as a one-dimensional spatial variable. The resulting clusters identify age-specific clusters of respiratory and fever/flu syndromes in the New York City Emergency Department Data.

Submitted by elamb on