A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
In the recent research paper, "The Most Influential Medical Journals According to Wikipedia: Quantitative Analysis"[1], the authors sought to determine the ranking of the most cited medical journals in English Wikipedia by evaluating the number of days between article publication and their citation. They analyzed 11,325 medical articles in Wikipedia that included citations from 137,889 articles from over 15,000 journals. They found that the top five journals cited, in order, include The Cochrane Database of Systematic Reviews, The New England Journal of Medicine, PLOS One, The BMJ, and JAMA: The Journal of the American Medical Association. This ranking, along with the next 25 journals they found in the study, was related to the highest ranked journals based on Journal Citation Reports and the Scientific Journal Ranking, yet due to the Wikipedia focus around reviews and meta-analysis, there were some clear differences. While evidence of recentism was identified, all journals that appeared in this study are directly related to medicine. The researchers suggested that similar studies be applied to other disciplinary areas, especially as "Wikipedia editing increases information literacy,"(p. 9) while also being more widely used by academics.
The authors in this study, "Detecting pages to protect in Wikipedia across multiple languages,"[2] wanted to understand aspects of page protection due to concerns related to vandalism, libel, and edit wars, and determine if tools could help automate this process. The researchers studied two data-sets: the 0.2% of pages which were protected in April 2016, and a similarly-sized random selection of unprotected pages. Their system performed well in predicting candidates for protection and has been developed to work across languages. The researchers hope to test this tool in live Wikipedias as their next step in automated page protection tests. (See also an earlier paper by some of the same authors: "DePP: A System for Detecting Pages to Protect in Wikipedia")
In "XLORE2: Large-scale Cross-lingual Knowledge Graph Construction and Application,"[3] from the inaugural issue (Winter 2019) of Data Intelligence—a joint venture between MIT and the Chinese Academy of Sciences—the authors explore better methods of mapping concepts between Chinese and English in XLORE2, whose taxonomy "is derived from the Wikipedia category system." Fewer than 5% of the over 100,000 infobox attributes in English Wikipedia are matched in Chinese Wikipedia. The authors discuss methods for improving the quality of typological relationships derived from English Wikipedia. Besides English and Chinese Wikipedia, their knowledge base also uses data from Baidu Baike and Hudong Baike.
In "Automated Detection of Online Abuse"[4] researchers applied machine learning to analyze the user behavior of blocked accounts on English Wikipedia. This analysis identified activity patterns of misconduct and modeled an WP:automated moderation system which could monitor unblocked user accounts to detect patterns of misconduct before human patrol identifies them. The research team's Wikimedia project page includes supplementary materials, including essays on ethical considerations for technological development in this direction.
See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines, and the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
From the abstract:[5]
"This paper will look at how cultural heritage organizations [GLAMs] can work with Wikidata, positioning themselves to become a more useful and accessible knowledge resource to the world."
This paper[6] provides a unified framework to extract topic specific subgraphs from Wikidata. These datasets can help develop new methods of knowledge graph processing and relational learning.
From the abstract:[7]
"...we propose a novel method for estimating socioeconomic indicators using open-source, geolocated textual information from Wikipedia articles. We demonstrate that modern NLP techniques can be used to predict community-level asset wealth and education outcomes using nearby geolocated Wikipedia articles. When paired with nightlights satellite imagery, our method outperforms all previously published benchmarks for this prediction task, indicating the potential of Wikipedia to inform both research in the social sciences and future policy decisions."
This paper[8] offers an effective method to build a comprehensive knowledge base by extracting information from Wikipedia infoboxes.
From the abstract:[9]
"The relationship between editors, derived from their sequence of editing activity, results in a directed network structure called the revision network, that potentially holds valuable insights into editing activity. In this paper we create revision networks to assess differences between controversial and non-controversial articles, as labelled by Wikipedia. Originating from complex networks, we apply motif analysis, which determines the under or over-representation of induced sub-structures, in this case triads of editors. We analyse 21,631 Wikipedia articles in this way, and use principal component analysis to consider the relationship between their motif subgraph ratio profiles. Results show that a small number of induced triads play an important role in characterising relationships between editors, with controversial articles having a tendency to cluster. This provides useful insight into editing behaviour [... and also] a potentially useful feature for future prediction of controversial Wikipedia articles."
From the abstract:[10]
[Wikidata's] data quality is managed and monitored by its community using several quality control mechanisms, recently including formal schemas in the Shape Expressions language. However, larger schemas can be tedious to write, making automatic inference of schemas from a set of exemplary Items an attractive prospect. This thesis investigates this option by updating and adapting the RDF2Graph program to infer schemas from a set of Wikidata Items, and providing a web-based tool which makes this process available to the Wikidata community."
From the abstract:[11]
"..."
"This talk explores the role Wikidata [...] might play in the task of assembling biodiversity information into a single, richly annotated and cross linked structure known as the biodiversity knowledge graph [...] Much of the content of Wikispecies is being automatically added to Wikidata [...] Wikidata is a candidate for the 'bibliography of life' [...], a database of all taxonomic literature."
This paper[12] presents a system for detecting newsworthy events occurring at different locations of the world from a wide range of categories using Twitter and Wikipedia.
Described as "the first comparative analysis of deep-learning models to assess Wikipedia article quality", this paper[13] observes that the most important quality indicators (model features) appear to be the following: That the article has been reviewed by "quality" editors, the edit count of the contributors, and the number of its translations.
From the English abstract:[14]
"In this empirical study, we examine a facet of [Wikipedia's] governance: the modalities of construction of two rules related to the citation of sources. We show that these rules are discussed and written by a minority of contributors who are particularly involved. Thus, in Wikipedia, there is no “political class” cut off from the ground. The modalities for the elaboration of the two rules are studied and discussed using Ostrom’s theory of the Commons."
From the abstract:[15]
We show on both Romanian and Danish wikis that using only the edit and their distribution over time to feed clustering techniques, allows to build [editor's] profiles with good accuracy and stability. This suggests that light monitoring of newcomers may be sufficient to adapt the interaction with them and to increase the retention rate."
Discuss this story
Those who are curious about influential journals on Wikipedia may want to take a look at WP:JCW, in particular WP:JCW/TAR. Headbomb {t · c · p · b} 20:06, 31 July 2019 (UTC)[reply]
Does anyone at all understand Ashford et al's "Understanding the Signature of Controversial Wikipedia Articles through Motifs in Editor Revision Networks"? I've never seen anything like it. I asked in more detail at WP:JIMBOTALK#Ashford et al. in "Recent research". EllenCT (talk) 11:12, 1 August 2019 (UTC)[reply]