In a recent article published in Scientific Reportsresearchers explored the applicability of machine learning (ML) approaches and using digital traces from social media to develop and test an early alert indicator and trend forecasting model for pandemic situations in Germany.

Study: Development of an early alert model for pandemic situations in Germany. Image Credit: Corona Borealis Studio/Shutterstock.comStudy: Development of an early alert model for pandemic situations in Germany. Image Credit: Corona Borealis Studio/


In early 2020, when the first severe acute respiratory syndrome coronavirus type 2 (SARS-CoV-2) outbreak occurred in China, healthcare systems of several countries were not ready to handle the ensuing pandemic. 

Delayed measures to prevent its onward spread were either not taken or taken too late due to the lack of an early warning system (EWS), which resulted in three million positive cases of coronavirus disease 2019 (COVID-19) worldwide. The unprecedented COVID-19 pandemic raised the urgent need to increase the preparedness of global healthcare systems.

Responding to this, the Artificial Intelligence Tools for Outbreak Detection and Response (AIOLOS), a French-German collaboration, tested several ML modeling approaches to support the development of an EWS utilizing Google Trends and Twitter data on COVID-19 symptoms to forecast up-trends in conventional surveillance data, such as reports from healthcare facilities or public health agencies.

The challenge with such systems is the lack of fully automated and digital data recorded in real-time for analysis and prompt countermeasures during a pandemic. 

About the study

Thus, in the present study, researchers used social media data, particularly from Google Trends and Twitter, as a source of COVID-19-associated information where information spreads faster than traditional channels (e.g., newspapers). 

They used ontology, text mining, and statistical analysis to create a COVID-19 symptom corpus. Next, they used a log-linear regression model to examine the relationship between digital traces and surveillance data and developed pandemic trend-forecasting Random Forest and LSTM models. 

They defined the true-positive rates (TPR), false-positive rates (FPR), and false-negative rates (FNR) of the up-trends in surveillance data in agreement with a previous study by Kogan et al., who used a Bayesian model for anticipating COVID-19 infection up-trends in the United States of America (USA) a week ahead.

For the evaluation of trend decomposition, the researchers used Seasonal and Trend decomposition using the Loess (STL) method, where the "STL forecast" function allowed them to extend the time series data from a given interval to a future time point. 

Applying this to the training data, which covered a specific period, helped to extrapolate the data to predict the trend component for a future period. They focused on the top 20 symptoms and conducted the STL decomposition on the extrapolated data for each symptom.

Further, they used correlation analysis to compare the extrapolated trend with the trend component extracted from the entire dataset.

Further, the researchers examined whether there were increases in the frequency of certain COVID-19 symptoms in digital sources such as Google Trends and Twitter before similar increases in established surveillance data.

To this end, they examined 168 symptoms from Google Trends and 204 from Twitter and calculated their respective sensitivity, precision, and F1 scores.

Sensitivity measures the proportion of true positives, precision measures the proportion of true positives among all positive predictions, and F1 score is a combined measure of sensitivity and precision.

The researchers used the hypergeometric test to identify the 20 most significant terms related to the disease on Google Trends and Twitter between February 2020 and February 2022.

In this way, they investigated if combining multiple symptoms using the harmonic mean P-value (HMP) method could improve the accuracy of detecting increases in disease surveillance data.

Furthermore, the researchers used a sliding window approach involving data analysis within a specific time frame to build an ML classifier to predict future trends in confirmed COVID-19 cases and hospitalizations.

They set the forecast horizon to 14 days ahead. They used a nine-fold time series cross-validation scheme to tune the hyperparameters of the Random Forest and LSTM models during the training procedure. 

Finally, the team used the Shapley Additive Explanations (SHAP) method to understand the influence of individual Google search and Twitter terms on the LSTM's predictions of up-trends. The analysis involved calculating the mean absolute SHAP values for different predictive symptoms.

They created bar plots where the symptoms ranked in descending order of their mean absolute SHAP values.

The symptoms with higher SHAP values were considered more influential in predicting up-trends in confirmed COVID-19 cases and hospitalization. Examples are hypoxemia, headache, muscle pain, dry cough, and nausea. 


The researchers identified 162 symptoms related to COVID-19 and their 249 synonyms. Any symptoms with adjusted P values below a 5% significance level were considered significant in statistical analysis.

They ranked the symptom terms based on the frequency of their occurrence, which led to the top five symptom terms in the COVID-19-related literature. 

These were "pneumonia," "fever, pyrexia," "cough," "inflammation," and "shortness of breath, dyspnea, breathing difficulty, difficulty breathing, breathlessness, labored respiration." Furthermore, the top 20 symptoms account for 61.4% of the total co-occurrences of all identified symptoms.

The researchers found that the STL decomposition algorithm was robust and showed high correlations, nearly equal to one.

High F1 scores for symptoms, stuffy nose, joint pain, malaise, runny nose, and skin rash indicated their strong correlations with increases in confirmed cases. Symptoms with low F1 scores were multiple organ failure, rubor, and vomiting. Some symptoms, such as delirium, lethargy, and poor feeding, indicated the severity of COVID-19, including hospitalization and deaths.

Since different symptoms had high F1 scores in Google Trends and Twitter, it becomes important to consider multiple digital sources when analyzing symptom-level trends.

Overall, certain symptoms observed in digital traces can serve as early warning indicators for COVID-19 and detect the onset of pandemics ahead of classical surveillance data.

The researchers found that Google Trends had an F1 score of 0.5, while Twitter had an F1 score of 0.47 when tracking confirmed cases. These were lower for hospitalization and death, ~0.38 or even lower.

They noted that digital traces were unreliable for predicting deaths, but combining them was a promising way of detecting incident cases and hospitalization.

The LSTM model, using the combination of Google Trends and Twitter, showed better prediction performance, achieving an F1 score of 0.98 and 0.97 for up-trend forecasting of confirmed COVID-19 cases and hospitalizations, respectively, in Germany, with a larger forecast horizon of 14 days. It also predicted down-trends, with F1 scores of 0.91 and 0.96 for confirmed cases and hospitalizations, respectively. 


Early alert indicator and trend forecasting models for COVID-19 have been developed previously in other countries. However, since each country's socio-economic and cultural backgrounds vary, researchers developed an EWS specific to Germany.

The study demonstrated that combining Google Trends and Twitter data enabled accurate forecasting of COVID-19 trends two weeks (14 days) ahead of standard surveillance systems.

In the future, similar systematic tracking of digital traces could complement established surveillance data assessment, data, and text mining of news articles to promptly react to future pandemic situations that may arise in Germany.

Source link