Skip to main content

Identifying Syndromic Fingerprints in Reason Fields in Emergency Department or Telehealth Records using N-grams for Similarity Analysis

Description

An N-gram is a sub-sequence of n items from a given sequence where n can be 1, 2,…, n and the items can be letters or words. N-gram models are widely used in statistical natural language processing [3]. In the syndromic surveillance context, N-grams can be used to cluster or classify natural language data.  They can also help in the design of kernels for machine learning algorithms such as support vector machines to learn from text data.  This work calculates the similarity percentages of ED or TH reasons to syndromic fingerprints using Ngrams. We define “reasons similarity” as the percentage of matched N-grams derived from the reasons field of an ED or TH record with the fingerprint of a syndrome. The fingerprint of a syndrome is a list of frequent N-grams related to this syndrome.  This fingerprint is constructed by collecting a large sample of classified reasons data for a particular syndrome, calculating all of the N-grams for this set and then selecting the most frequent N-grams to form a profile or fingerprint. N-gram generation may require extensive processing time especially for large files but this issue has been addressed by using parallel computation.

Objective

The objective of this work is to identify syndromic fingerprints in reasons for entering an emergency department (ED) or calling telehealth (TH). It also demonstrates that these fingerprints are valuable for classification.

Submitted by elamb on