A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
"Understanding Wikipedia as a Resource for Opportunistic Learning of Computing Concepts" by Martin P. Robillard and Christoph Treude of McGill University and University of Adelaide and published in SIGCSE 2020 examines the utility of Wikipedia articles about computing concepts for novice programmers.[1] The researchers recruit 18 students with varying computer science backgrounds to read Wikipedia articles about computing concepts that are new to them. The authors use a sample of four Wikipedia articles that appear frequently in Stack Overflow posts (Dependency injection; Endianness; Levenshtein distance; Regular expression). Side note: in a sample of 44 million posts on Stack Overflow that the authors process, 360 thousand (0.8%) have a Wikipedia link, pointing to 40 thousand different Wikipedia articles in aggregate. They indicate that this rate of linking to Wikipedia is similar on the Reddit subreddit r/programming as well. The participants are instructed to use a think-aloud method where they talk through what they are doing and thinking as they try to learn about the concept. The authors then analyzed the transcripts from these interviews to determine what themes were consistent across the students.
The researchers identified the following challenges to learning from Wikipedia:
While the authors conclude that linking to more structured learning resources from Stack Overflow and related forums might be beneficial, this research clearly provokes some thought about how Wikipedia might be a more effective learning context. For instance, page previews are a clear improvement for readers who are not familiar with the concepts mentioned in an article. The other concepts emphasize the value of surfacing examples in articles, not relying on mathematical notation to explain a concept, and having a clear lede paragraph. Two other thoughts about this research:
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
From the abstract:[2]
"Despite Wikidata’s potential and the notable rise in research activity, the field is still in the early stages of study. Most research is published in conferences, highlighting such immaturity, and provides little empirical evidence of real use cases. Only a few disciplines currently benefit from Wikidata’s applications and do so with a significant gap between research and practice. Studies are dominated by European researchers, mirroring Wikidata’s content distribution and limiting its Worldwide applications."
See also earlier coverage of a related paper by Piscopo et al.: "First literature survey of Wikidata quality research", and the following preprint
From the abstract:[3]
"[We conduct] a systematic mapping study in order to identify the current topical coverage of existing research [on Wikidata] as well as the white spots which need further investigation. [...] 67 peer-reviewed research from journals and conference proceedings were selected, and classified into meaningful categories. We describe this data set descriptively by showing the publication frequency, the publication venue and the origin of the authors and reveal current research focuses. These especially include aspects concerning data quality, including questions related to language coverage and data integrity. These results indicate a number of future research directions, such as, multilingualism and overcoming language gaps, the impact of plurality on the quality of Wikidata's data, Wikidata's potential in various disciplines, and usability of user interface."
From the abstract:[4]
we present an approach for extracting numerical and date literal values from Wikipedia abstracts [apparently a reference to the lead section of Wikipedia articles or their first paragraph]. We show that our approach can add 643k additional literal values to DBpedia] at a precision of about 95%.
From the abstract and accompanying note:[5]
"A major challenge for bringing Wikidata to its full potential was to provide reliable and powerful services for data sharing and query, and the WMF has chosen to rely on semantic technologies for this purpose. A live SPARQL endpoint, regular RDF dumps, and linked data APIs are now forming the backbone of many uses of Wikidata. We describe this influential use case and its underlying infrastructure [...] The data used in this publication is available in the form of anonymised Wikidata SPARQL query logs. [....] The paper has won the Best Paper Award in the In-Use Track of ISWC 2018."
From the abstract:[6]
" we study the relationship between different Wikidata user roles and the quality of the Wikidata ontology. [...] Our analysis shows that the Wikidata ontology has uneven breadth and depth. We identified two user roles: contributors and leaders. The second category is positively associated to ontology depth, with no significant effect on other features. Further work should investigate other dimensions to define user profiles and their influence on the knowledge graph."
See also comments about the paper by Daniel Mietchen, and earlier coverage of a related paper: "Participation patterns on Wikidata"
From the abstract:[7]
"We ... study the evolution that [Wikidata] editors with different levels of engagement exhibit in their editing behaviour over time. We measure an editor’s engagement in terms of (i) the volume of edits provided by the editor and (ii) their lifespan (i.e. the length of time for which an editor is present at Wikidata). The large-scale longitudinal data analysis that we perform covers Wikidata edits over almost 4 years. We monitor evolution in a session-by-session- and monthly-basis, observing the way the participation, the volume and the diversity of edits done by Wikidata editors change. Using the findings in our exploratory analysis, we define and implement prediction models that use the multiple evolution indicators."
This study[8] found that the migration place for historically relevant people is limited to few locations, depending on discipline and opportunities.
This preprint[9] presents a tool for extracting multilingual and gender-balanced parallel corpora at sentence level, with document and gender information.
This study[10] found that on Wikipedia, controversial and high-quality articles articles differ from others, according to metrics quantifying editing and linking behavior.
This preprint[11] presents a new graph-based recurrent retrieval approach to answer multi-hop open-domain questions through the Wikipedia link graph.
From the abstract:[12]
"we compared articles about the same intergroup conflicts (e.g., the Falklands War) in the corresponding language versions of Wikipedia (e.g., the Spanish and English Wikipedia articles about the Falklands War). Study 1 featured a content coding of translated Wikipedia articles by trained raters, which showed that articles systematically presented the ingroup in a more favourable way (e.g., Argentina in the Spanish article and the United Kingdom in the English article) and, in reverse, the outgroup as more immoral and more responsible for the conflict. These findings were replicated and extended in Study 2, which was limited to the lead sections of articles but included considerably more conflicts and many participants instead of a few trained coders. This procedure [identified] a stronger ingroup bias for (1) more recent conflicts and (2) conflicts in which the proportion of ingroup members among the top editors was larger. Finally, a third study ruled out that these effects were driven by translations or the raters’ own nationality. Therefore, this paper is the first to demonstrate ingroup bias in Wikipedia – a finding that is of practical as well as theoretical relevance as we outline in the discussion."
From the abstract:[13]
"[We study] two planetary-scale collaborative systems: GitHub and Wikipedia. By analyzing the activity of over 2 million users on these platforms, we discover that the interplay between group size and productivity exhibits complex, previously-unobserved dynamics: the productivity of smaller groups scales super-linearly with group size, but saturates at larger sizes. This effect is not an artifact of the heterogeneity of productivity: the relation between group size and productivity holds at the individual level. People tend to do more when collaborating with more people. We propose a generative model of individual productivity that captures the non-linearity in collaboration effort. The proposed model is able to explain and predict group work dynamics in GitHub and Wikipedia by capturing their maximally informative behavioral features."
Discuss this story
- That's a lot to digest! Bearian (talk) 14:32, 27 January 2020 (UTC)[reply]
Many technical articles on Wikipedia have the problems identified in the Opportunistic Learning piece. Good general suggestions for improvements. I'm especially active in removing tangential information, fixing terminology issues and improving leads. ~Kvng (talk) 14:40, 2 February 2020 (UTC)[reply]