Skip to main content

Ngram classifiers

Description

Syndromic surveillance of emergency department (ED) visit data is often based on computer algorithms which assign patient chief complaints (CC) and ICD code data to syndromes. The triage nurse note (NN) has also been used for surveillance. Previously we developed an “NGram” classifier for syndromic surveillance of ED CC in Italian for detection of natural outbreaks and bioterrorism. The classifier is developed from a set of ED visits for which both the ICD diagnosis code and CC are available by measuring the associations of text fragments within the CC (e.g. 3 characters for a “3-gram”) with a syndromic group of ICD codes. We found good correlation between daily volumes by the ICD10 classifier and estimated by NGrams. However, because the CC was limited to 23 options based on the pick list, it might be possible to obtain results as good as the NGram method or better using a simpler probabilistic approach. Also, in addition to the CC, the Italian data included a free-text NN note. We might be able achieve improved performance by applying the n-gram method to the NN or the CC supplemented by the NN.

 

Objective

Our objective was to compare the performance of the NGram CC classifier to two discrete classifiers based on probabilistic associations with the CC pick list items. Also, we wished to determine the performance of the NGram method applied to CC alone, NN alone, and CC plus NN.

Submitted by elamb on
Description

Syndromic surveillance of emergency department (ED) visit data is often based on computer algorithms which assign patient chief complaints (CC) to syndromes. ICD9 code data may also be used to develop visit classifiers for syndromic surveillance but the ICD9 code is generally not available immediately, thus limiting its utility. However, ICD9 has the advantages that ICD9 classifiers may be created rapidly and precisely as a subset of existing ICD9 codes and that the ICD9 codes are independent of the spoken language. If a classifier based on ICD9 codes could be used to automatically create the code for a chief-complaint assignment algorithm then CC algorithms could be created and updated more rapidly and with less labor. They could also be created in multiple spoken languages. We had developed a method for doing this based on an “ngram” text processing program adapted from business research technology (AT&T Labs). The method applies the ICD9 classifier to a training set of ED visits for which both the CC and ICD9 code are known. A computerized method is used to automatically generate a collection of CC substrings with associated probabilities, and then generate a CC classifier program. The method includes specialized selection techniques and model pruning to automatically create a compact and efficient classifier.

 

Objective

Our objective was to determine how closely the performance of an ngram CC classifier for the gastrointestinal syndrome matched the performance of the ICD9 classifier.

Submitted by elamb on
Description

Previously we used an “N-Gram” classifier for syndromic surveillance of emergency department (ED) chief complaints (CC) in English for bioterrorism. The classifier is trained on a set of ED visits for which both the ICD diagnosis code and CC are available by measuring the associations of text fragments within the CC (e.g. 3 characters for a “3-gram”) with a syndromic group of ICD codes. Because the ICD system is language independent, the technique has the potential advantage of rapid automated deployment in multiple languages. Our objective was to apply the N-Gram method to a training set of Turkish ED data to create a Turkish CC classifier for the respiratory syndrome (RESP) and determine its performance in a test set.

 

Objective

To determine how closely the performance of an ngram CC classifier for the RESP syndrome matched the performance of the ICD9 classifier.

Submitted by elamb on
Description

Previously we developed an “Ngram” classifier for syndromic surveillance of emergency department (ED) chief complaints (CC) in Turkish for bioterrorism. The classifier is developed from a set of ED visits for which both the ICD diagnosis code and CC are available. A computer program calculates the associations of text fragments within the CC (e.g. 3 characters for a “3-gram”) with a syndromic group of ICD codes. The program then generates an algorithm which can be deployed to evaluate chief complaint data in real-time. However, the N-gram method differs from most other classifiers in that it assigns a probability that each visit falls within the syndrome rather than ruling the visit “in” or “out” of the syndrome. It is possible to dichotomize visits “in” or “out” using N-grams by choosing a cut-off sensitivity for the n-grams used, but this affects the specificity of the method. The effect of this trade-off is best measured by a receiveroperator curve.

 

Objective

Our objective was to determine the sensitivity and specificity of the Ngram CC classifier for individual ED visits. We also wish to compare these results to those obtained when we substituted anglicized characters for 6 problematic Turkish characters.

Submitted by elamb on
Description

A number of different methods are currently used to classify patients into syndromic groups based on the patient’s chief complaint (CC). We previously reported results using an “Ngram” text processing program for building classifiers (adapted from business research technology at AT&T Labs). The method applies the ICD9 classifier to a training set of ED visits for which both the CC and ICD9 code are known. A computerized method is used to automatically generate a collection of CC substrings (or Ngrams), with associated probabilities, from the training data. We then generate a CC classifier from the collection of Ngrams and use it to find a classification probability for each patient. Previously, we presented data showing good correlation between daily volumes as measured by the Ngram and ICD9 classifiers.

 

Objective

Our objective was to determine the optimized values for the sensitivity and specificity of the Ngram CC classifier for individual visits using a ROC curve analysis. Points on the ROC curve correspond to different classification probability cutoffs.

Submitted by elamb on
Description

 

Syndromic surveillance of emergency department(ED) visit data is often based on computerized classifiers which assign patient chief complaints (CC) tosyndromes. These classifiers may need to be updatedperiodically to account for changes over time in the way the CC is recorded or because of the addition of new data sources. Little information is available as to whether more frequent updates would actually improve classifier performance significantly. It can be burdensome to update classifiers which are developed and maintained manually. We had available to us an automated method for creating classifiers thatallowed us to address this question more easily. The “Ngram” method, described previously, creates a CC classifier automatically based on a training set of patient visits for which both the CC and ICD9 are available. This method measures the associations of text fragments within the CC (e.g. 3 characters for a “3-gram”) with a syndromic group of ICD9 codes. It then automatically creates a new CC classifier based on these associations. The CC classifier thus created can then be deployed for daily syndromic surveillance.

Objective

Our objective was to determine if performance of the Ngram classifier for the GI syndrome was improved significantly by updating the classifier more frequently.

Submitted by elamb on