Skip to main content

Web Scraping

Description

Timely surveillance of disease outbreak events of public health concern currently requires detailed and time consuming manual analysis by experts. Recently in addition to traditional information sources, the World Wide Web has offered a new modality in surveillance, but the massive collection of multilingual texts which must be processed in real time presents an enormous challenge.

 

Objective

In this paper we present a summary of the BioCaster system architecture for Web rumour surveillance, the rationale for the choices made in the system design and an empirical evaluation of topic classification accuracy for a gold-standard of English and Vietnamese news.

Submitted by elamb on
Description

Most countries do not report national notifiable disease data in a machine-readable format. Data are often in the form of a file that contains text, tables and graphs summarizing weekly or monthly disease counts. This presents a problem when information is needed for more data intensive approaches to epidemiology, biosurveillance and public health. While most nations likely store incident data in a machine-readable format, governments are often hesitant to share data openly for a variety of reasons that include technical, political, economic, and motivational issues1. A survey conducted by LANL of notifiable disease data reporting in over fifty countries identified only a few websites that report data in a machine-readable format. The majority (>70%) produce reports as PDF files on a regular basis. The bulk of the PDF reports present data in a structured tabular format, while some report in natural language. The structure and format of PDF reports change often; this adds to the complexity of identifying and parsing the desired data. Not all websites publish in English, and it is common to find typos and clerical errors. LANL has developed a tool, Epi Archive, to collect global notifiable disease data automatically and continuously and make it uniform and readily accessible.

Objective:

LANL has built software that automatically collects global notifiable disease data, synthesizes the data, and makes it available to humans and computers within the Biosurveillance Ecosystem (BSVE) as a novel data stream. These data have many applications including improving the prediction and early warning of disease events.

Submitted by elamb on