Skip to main content

Eliciting Disease Data from Wikipedia Articles

Description

Traditional disease surveillance systems suffer from several disadvantages, including reporting lags and antiquated technology, that have caused a movement towards internet-based disease surveillance systems. Recently, Wikipedia access logs (e.g., McIver 2014, Generous 2014) have been shown to be effective in this arena. Much richer Wikipedia data are available, though, including the entire Wikipedia article content and edit histories.

We study two different aspects of Wikipedia content as it relates to unfolding disease events: 1) we demonstrate how to capture case, death, and hospitalization counts from the article text, and 2) we show there are valuable time series data present in the tables found in certain articles.

We argue that Wikipedia data cannot only be used for disease surveillance but also as a centralized repository system for collecting disease-related data in near real-time.

Objective

To improve traditional outbreak surveillance systems by utilizing the content of Wikipedia articles.

Submitted by teresa.hamby@d… on