A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
A group of researchers at Facebook investigated whether computational language models inherently contain knowledge from the source they were trained on.[1] To test this hypothesis, they conducted preliminary experiments using the FEVER dataset, which "consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from".[2]
The typical fact-checking pipeline involves storing an external knowledge base, retrieving evidence from it that is relevant to the given claim, and then verifying the claim based on the retrieved evidence. The authors replaces this traditional pipeline with a language model, specifically BERT. This approach is computationally cheaper, utilizes widely-available technology, and increases the versatility of language models.
Exploratory experiments involved selecting random claims from the FEVER dataset and masking them. An example of this would be taking the claim Thomas Jefferson founded the University of Virginia after retiring and masking it to be Thomas Jefferson founded the University of <MASK> after retiring. The masked example was given to the language model to predict the masked word, and the claims were verified by humans as either supporting or refuting the claim. For a sample set of 50 claims, there was an average accuracy of 55%, encouraging the researchers to try a computational approach. Full experiments involved using named-entity recognition (NER) to remove the last named entity in a claim – this is based on the observation that factuality hinges upon the correctness of the two entities and the relationship between them, not on how the claim is phrased. Then these entities were masked and passed to the BERT language model to predict the missing word, and automatically classified as support, refute, or not enough information (NEI).
Results from these experiments had an accuracy of 49%, which is comparable to the FEVER baseline of 48.8% (although it falls short of the traditional state-of-the art model of 68.21%). Additionally, the average F1 score of 0.58 for identifying the support claim indicates the model was unable to distinguish between the refute and NEI classes. An analysis of the predicted tokens also reveals some insights about the nature of using a language model for fact-checking – a claim such as Tim Roth was born in <MASK> should have predicted 1961, but predicted London. BERT is trained on Wikipedia, and is therefore subject to its stylistic patterns (on Wikipedia, dates of birth are typically found in parentheses, whereas locations are likely to be presented in a claim format). This indicates the pretraining of the model determines the way in which it should be 'queried' for information. While not comparable to the state of the art, the researchers conclude that their approach has strong potential for improvement, and can lead to stronger and more efficient solutions for generating evidences and masking claims.
COVID-19 research in Wikipedia[3] by Giovanni Colavizza from University of Amsterdam (available as pre-print on bioRxiv) investigates how editors on Wikipedia find, select, and integrate scientific information on COVID-19 into Wikipedia articles. Given the surge of new scientific publications on COVID-19 – since the beginning of 2020, more than 20,000 new scientific articles have been published around the topic – how do editors keep up with the amount of information, while at the same time ensuring high quality?
For this, the author assembles a corpus representing research on COVID-19 from several publicly available resources such as Pubmed, bioRxiv, WHO, etc., comprising more than 60,000 publications in total. To determine whether these publications have been integrated into Wikipedia, the author uses data from Altmetric which matches citations in Wikipedia articles with known identifiers of publications such as DOI.
Using this approach, the study draws a detailed picture of the editorial work around COVID-19 in Wikipedia. First, editors seem to have been able to cope with the rapidly growing literature. Slightly more than 3% of publications are cited at least once in Wikipedia. Taking into account a more than 10-fold increase in the number of publications in 2020, this fraction has been remarkably stable for publications published in recent years in comparison to, say, 20 years ago. Second, editors are citing a largely representative sample of the literature in terms of the topic diversity. Clustering publications into 7 different topics using an LDA topic model reveals that coverage of topics in Wikipedia reflect the overall imbalance in the scientific literature (with most research on coronaviruses as well as public health and epidemics). Third, editors seem to follow the same inclusion standards for publications in 2020 as before (see also WP:MEDRS), relying on research that is not only visible and impactful (e.g. mentions in news and blogposts) but also appears in peer-reviewed and specialized journals (e.g. The Lancet) and avoiding pre-prints, which is revealed through different regression models.
One of the main limitations of this study is that it only considers the citations to scientific publications in Wikipedia articles. Thus, directions for future work include taking into account content of Wikipedia articles or studying edit and discussion history of the respective pages as well as comparing coverage with expert reviews on COVID-19.
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
From the abstract:[4]
"Using 973,940 revisions from 134,337 editors to 4,238 articles, this study examines the dynamics of the English Wikipedia’s response to the coronavirus pandemic through the first five months of 2020 as a 'quantitative portrait' describing the emergent collaborative behavior at three levels of analysis: article revision, editor contributions, and network dynamics. Across multiple data sources, quantitative methods, and levels of analysis, we find four consistent themes characterizing Wikipedia’s unique large-scale, high-tempo, and temporary online collaborations: external events as drivers of activity, spillovers of activity, complex patterns of editor engagement, and the shadows of the future."
From the abstract:[5]
"We study how the coronavirus disease 2019 (COVID-19) pandemic, alongside the severe mobility restrictions that ensued, has impacted information access on Wikipedia [...]. A longitudinal analysis that combines pageview statistics for 12 Wikipedia language editions with mobility reports published by Apple and Google reveals a massive increase in access volume, accompanied by a stark shift in topical interests. Health- and entertainment- related topics are found to have gained, and sports- and transportation- related topics, to have lost attention. Interestingly, while the interest in health-related topics was transient, that in entertainment topics is lingering and even increasing. These changes began at the time when mobility was restricted and are most pronounced for language editions associated with countries, in which the most severe mobility restrictions were implemented, indicating that the interest shift might be caused by people's spending more time at home."
From the abstract:[6]
"Our results show that public attention, quantified as users activity on Reddit and active searches on Wikipedia pages, is mainly driven by media coverage and declines rapidly, while news exposure and COVID-19 incidence remain high. Furthermore, by using an unsupervised, dynamical topic modeling approach, we show that while the attention dedicated to different topics by media and online users are in good accordance, interesting deviations emerge in their temporal patterns."
From the abstract:[7]
"Pandemics, even more than other medical problems, require swift integration of knowledge. When caused by a new virus, understanding the underlying biology may help finding solutions. In a setting where there are a large number of loosely related projects and initiatives, we need common ground, also known as a “commons”. Wikidata, a public knowledge graph aligned with Wikipedia, is such a commons and uses unique identifiers to link knowledge in other knowledge bases [...] we describe the process of aligning resources on the genomes and proteomes of the SARS-CoV-2 virus and related viruses as well as how Shape Expressions can be defined for Wikidata to model the knowledge, helping others studying the SARS-CoV-2 pandemic. How this model can be used to make data between various resources interoperable, is demonstrated by integrating data from NCBI Taxonomy, NCBI Genes, UniProt, and WikiPathways. Based on that model, a set of automated applications or bots were written for regular updates of these sources in Wikidata and added to a platform for automatically running these updates."
From the abstract:[8]
"... SWAT, a system that identifies the salient Wikipedia entities occurring in an input document. SWAT consists of several modules that are able to detect and classify on-the-fly Wikipedia entities as salient or not, based on a large number of syntactic, semantic and latent features properly extracted via a supervised process which has been trained over millions of examples drawn from the New York Times corpus. [...]SWAT improves known solutions over all publicly available datasets. We release SWAT via an API that we describe and comment in the paper in order to ease its use in other software."
See also Online demo
From the abstract:[9]
"... we propose an original framework, based on the Wikipedia Comment corpus, with comment-level abuse annotations of different types. The major contribution concerns the reconstruction of conversations, by comparison to existing corpora, which focus only on isolated messages (i.e. taken out of their conversational context). This large corpus of more than 380k annotated messages opens perspectives for online abuse detection and especially for context-based approaches. We also propose, in addition to this corpus, a complete benchmarking platform to stimulate and fairly compare scientific works around the problem of content abuse detection, trying to avoid the recurring problem of result replication. Finally, we apply two classification methods to our dataset to demonstrate its potential."
From the abstract:[10]
"we introduce the Shinra 5-Language Categorization Dataset (SHINRA-5LDS), a large multi-lingual and multi-labeled set of annotated Wikipedia articles in Japanese, English, French, German, and Farsi using Extended Named Entity (ENE) tag set."
From the abstract:[11]
"Given a Knowledge Graph, a knowledge corpus, and a fact (triple statement), the goal of fact-checking is to decide whether the fact or knowledge is correct or not. Existing approaches extensively used several structural features of the input Knowledge Graph to address the mentioned problem. [...] Our approach considers finding evidence from Wikipedia and structured information from Wikidata, which helps in determining the validity of the input facts. [...] The similarity of input fact with elements of relevant Wikipedia pages has been used as unstructured features. The experiments with a dataset consisting of nine relations of Wikidata has established the advantage of combining unstructured features with structured features for the given task."
{{cite bioRxiv}}
: CS1 maint: date and year (link)
Discuss this story