Lemmas and Named Entities analysis in major media outlets regarding Switzerland and Covid-19


What was on the media in Switzerland during the beginning of the COVID-19 pandemic? What were they talking about, money or ICU beds? Well, if you want to find the answer, it’s in this dataset, waiting to be dug out. Help yourself!

This dataset, generated with an ad-hoc parser and NLP pipeline, analyzes the frequency of lemmas and named entities in news articles (in German, French, Italian and English ) regarding Switzerland and COVID-19. 

The analysis of large bodies of grey literature via text mining and computational linguistics is an increasingly frequent approach to understand the large-scale trends of specific topics. We used Factiva, a news monitoring and search engine developed and owned by Dow Jones, to gather and download all the news articles published between January 2020 and May 2021 on Covid-19 and Switzerland.

Due to Factiva’s copyright policy, it is not possible to share the original dataset with the exports of the articles’ text; however, we can share the results of our work on the corpus. All the information relevant to reproduce the results is provided.

Factiva allows a very granular definition of the queries, and moreover has access to full text articles published by the major media outlet of the world. The query has been defined as follows (syntax in bold, explanation in italics):

((coronavirus or Wuhan virus or corvid19 or corvid 19 or covid19 or covid 19 or ncov or novel coronavirus or sars) and (atleast3 coronavirus or atleast3 wuhan or atleast3 corvid* or atleast3 covid* or atleast3 ncov or atleast3 novel or atleast3 corona*))

Keywords for covid19; must appear at least 3 times in the text

and ns=(gsars or gout)

Subject is “novel coronaviruses” or “outbreaks and epidemics” and “general news”

and la=X

Language is X (DE, FR, IT, EN)

and rst=tmnb

Restrict to TMNB (major news and business publications)

and wc>300

At least 300 words

and date from 20191001 to 20212005

Date interval

and re=SWITZ

Region is Switzerland

It is important to specify some details that characterize the query. 
The query is not limited to articles published by Swiss media, but to articles regarding Switzerland. The reason is simple: a Swiss user googling for “Schweiz Coronavirus” or for “Coronavirus Ticino” can easily find and read articles published by foreign media outlets (namely, German or Italian) on that topic. If the objective is capturing and describing the information trends to which people are exposed, this approach makes much more sense than limiting the analysis to articles published by Swiss media.
Factiva’s field “NS” is a descriptor for the content of the article. “gsars” is defined in Factiva’s documentation as “All news on Severe Acute Respiratory Syndrome”, and “gout” as “The widespread occurrence of an infectious disease affecting many people or animals in a given population at the same time”; however, the way these descriptors are assigned to articles is not specified in the documentation.

Finally, the query has been restricted to major news and business publications of at least 300 words. Duplicate check is performed by Factiva. Given the incredibly large amount of articles published on COVID-19, this (absolutely arbitrary) restriction allows retrieving a corpus that is both meaningful and manageable.

metadata.xlsx contains information about the articles retrieved (strategy, amount)

This work is part of the PubliCo research project.

Here’s the dataset:

https://zenodo.org/record/4792778#.YqDsE6hBwdU