A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
"What is Trending on Wikipedia? Capturing Trends and Language Biases Across Wikipedia Editions" by Volodymyr Miz, Joëlle Hanna, Nicolas Aspert, Benjamin Ricaud, and Pierre Vandergheynst of EPFL, published at WikiWorkshop as part of The Web Conference 2020, examines what topics trend on Wikipedia (i.e. attract high numbers of pageviews) and how these trending topics vary by language.[1] Specifically, the authors study aggregate pageview data from September - December of 2018 for English, French, and Russian Wikipedia. In the paper, trending topics are defined as clusters of articles that are linked together and all receive a spike in pageviews over a given period of time. Eight high-level topics are identified that encapsulate most of the trending articles (football, sports other than football, politics, movies, music, conflicts, religion, science, and video games). Articles are mapped to these high-level topics through a classifier trained over article extracts in which the labeled data comes from a set of articles that were labeled via heuristics such as the phrase "(album)" being in the article title indicating music.
The authors find a number of topics that span language communities in popularity, as well as topics that are much more locally popular (e.g., specific to the United States or France or Russia). Singular events (e.g., a hurricane that has a specific Wikipedia article) often lead to tens of related pages (e.g., about past hurricanes or scientific descriptions) receiving correlated spikes. This is a trend that has been especially apparent with the current pandemic, as pages adjacent to main pandemic such as social distancing, past pandemics, or regions around the world have also received high spikes in traffic. They discuss how these trending topics relate to the motivations of Wikipedia readers, geography, culture, and artifacts such as featured articles or Google doodles.
It is always exciting to see work that explicitly compares language editions of Wikipedia. Highlighting these similarities and differences as well as developing methods to study Wikipedia across languages are valuable contributions. While it is interesting to explore differences in interest across languages, these types of analyses can also help recommend what types of articles are valuable to be translated into a given language and will hopefully be further developed with some of these applications in mind. The authors identify that Wikidata shows promise in improving their approach to labeling articles with topics. It should be noted that Wikimedia has also recently developed approaches to identifying the topics associated with an article that have greater coverage (i.e. ~60 topics instead of 8) and are based on the WikiProject taxonomy. This has been expanded experimentally to Wikidata as well (see here).
For more details, see:
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
From the abstract:[2]
From the abstract: "we introduce a novel testbed for natural language generation: automatically bringing inappropriately subjective text into a neutral point of view ("neutralizing" biased text). We also offer the first parallel corpus of biased language. The corpus contains 180,000 sentence pairs and originates from Wikipedia edits that removed various framings, presuppositions, and attitudes from biased sentences. Last, we propose two strong encoder-decoder [algorithm] baselines for the task [of 'neutralizing' biased text]."
Among the example changes the authors quote from their corpus:
Original | New (NPOV) version |
---|---|
A new downtown is being developed which will bring back... | A new downtown is being developed which [...] its promoters hope will bring back... |
Jewish forces overcome Arab militants. | Jewish forces overcome Arab forces. |
A lead programmer usually spends his career mired in obscurity. | Lead programmers often spend their careers mired in obscurity. |
As example output for one of their algorithms, the authors present the change from
to
The authors construct a RNN (Recurrent Neural Network) able to detect biased statements from the English Wikipedia with 91.7% precision, and "release the largest corpus of statements annotated for biased language". From the paper:[3]
"We extract all statements from the entire revision history of the English Wikipedia, for those revisions that contain the POV tag in the comments. This leaves us with 1,226,959 revisions. We compare each revision with the previous revision of the same article and filter revisions where only a single statement has been modified.[...] The final resulting dataset leaves us with 280,538 pov-tagged statements. [...] we [then] ask workers to identify statements containing phrasing bias in the Figure Eight platform. Since labeling the full pov-tagged dataset would be too expensive, we take a random sample of 5000 statement from the dataset. [...] we present our approach for classifying biased language in Wikipedia statements [using] Recurrent Neural Networks (RNNs) with gated recurrent units (GRU)."
From the abstract:[4]
"This thesis makes a threefold contribution: (i.) it evaluates two previously uncovered aspects of the quality of Wikidata, i.e. provenance and its ontology; (ii.) it is the first to investigate the effects of algorithmic contributions, i.e. bots, on Wikidata quality; (iii.) it looks at emerging editor activity patterns in Wikidata and their effects on outcome quality. Our findings show that bots are important for the quality of the knowledge graph, albeit their work needs to be continuously controlled since they are potentially able to introduce different sorts of errors at a large scale. Regarding human editors, a more diverse user pool—in terms of tenure and focus of activity—seems to be associated to higher quality. Finally, two roles emerge from the editing patterns of Wikidata users, leaders and contributors. Leaders [...] are also more involved in the maintenance of the Wikidata schema, their activity being positively related to the growth of its taxonomy."
See also earlier coverage of a related paper coauthored by the same author: "First literature survey of Wikidata quality research"
From the abstract:[5]
"The quantitative evaluation of quotations in the Russian Wiktionary was performed using the developed Wiktionary parser. It was found that the number of quotations in the dictionary is growing fast (51.5 thousands in 2011, 62 thousands in 2012). [...] A histogram of distribution of quotations of literary works written in different years was built. It was made an attempt to explain the characteristics of the histogram by associating it with the years of the most popular and cited (in the Russian Wiktionary) writers of the nineteenth century. It was found that more than one-third of all the quotations (the example sentences) contained in the Russian Wiktionary are taken by the editors of a Wiktionary entry from the Russian National Corpus."
The top authors quoted are: 1. Chekhov 2. Tolstoy 3. Pushkin 4. Dostoyevsky 5. Turgenev
From the abstract:[6]
"...we perform a literature review trying to answer three main questions: (i) What is disinformation? (ii) What are the most popular mechanisms to spread online disinformation? and (iii) Which are the mechanisms that are currently being used to fight against disinformation?. In all these three questions we take first a general approach, considering studies from different areas such as journalism and communications, sociology, philosophy, information and political sciences. And comparing those studies with the current situation on the Wikipedia ecosystem. We conclude that in order to keep Wikipedia as free as possible from disinformation, it is necessary to help patrollers to early detect disinformation and assess the credibility of external sources."
This paper by four Google Brain researchers describes automated methods for estimating the factual accuracy of automatic Wikipedia text summaries, using end-to-end fact extraction models trained on Wikipedia and Wikidata.[7]
From the abstract:[8]
"Wikipedia contains articles on many important news events, with page revisions providing near real-time coverage of the developments in the event. The set of revisions for a particular page is therefore useful to establish a timeline of the event itself and the availability of information about the event at a given moment. However, many revisions are not particularly relevant for such goals, for example spelling corrections or wikification edits. The current research aims [...] to identify which revisions are relevant for the description of an event. In a case study a set of revisions for a recent news event is manually annotated, and the annotations are used to train a Long Short Term Memory classifier for 11 revision categories. The classifier has a validation accuracy of around 0.69 which outperforms recent research on this task, although some overfitting is present in the case study data."
From the abstract and acknowledgements:[9]
"The concrete innovation of the DBpedia FlexiFusion workflow, leveraging the novel DBpedia PreFusion dataset, which we present in this paper, is to massively cut down the engineering workload to apply any of the [existing DBPedia quality improvement] methods available in shorter time and also make it easier to produce customized knowledge graphs or DBpedias.[...] our main use case in this paper is the generation of richer, language-specific DBpedias for the 20+ DBpedia chapters, which we demonstrate on the Catalan DBpedia. In this paper, we define a set of quality metrics and evaluate them for Wikidata and DBpedia datasets of several language chapters. Moreover, we show that an implementation of FlexiFusion, performed on the proposed PreFusion dataset, increases data size, richness as well as quality in comparison to the source datasets." [...] The work is in preparation to the start of the WMF-funded GlobalFactSync project (https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE ).
From the abstract and paper:[10]
"we propose a method for incorporating world knowledge (linked entities and fine-grained entity types) into a neural question generation model. This world knowledge helps to encode additional information related to the entities present in the passage required to generate human-like questions. [...] . In our experiments, we use Wikipedia as the knowledge base for which to link entities. This specific task (also known as Wikification (Cheng and Roth, 2013)) is the task of identifying concepts and entities in text and disambiguation them into the most specific corresponding Wikipedia pages."
From the (English version of the) abstract:[11]
"By analyzing the arguments in a corpus of discussion pages for articles on highly controversial subjects (genetically modified organisms, September 11, etc.), the authors show that [disagreements between Wikipedia editors] are partly fed by the existence on Wikipedia of concurrent 'epistemic regimes'. These epistemic regimes (encyclopedic, scientific, scientistic, wikipedist, critical, and doxic) correspond to divergent notions of validity and the accepted methods for producing valid information."
From the abstract:[12]
"... we describe ORES: an algorithmic scoring service that supports real-time scoring of wiki edits using multiple independent classifiers trained on different datasets. ORES decouples several activities that have typically all been performed by engineers: choosing or curating training data, building models to serve predictions, auditing predictions, and developing interfaces or automated agents that act on those predictions. This meta-algorithmic system was designed to open up socio-technical conversations about algorithmic systems in Wikipedia to a broader set of participants. In this paper, we discuss the theoretical mechanisms of social change ORES enables and detail case studies in participatory machine learning around ORES from the 4 years since its deployment."
Discuss this story