The Signpost

Recent research

Cross-language editors, election predictions, vandalism experiments

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

Cohort of cross-language Wikipedia editors analyzed

Network graph of the cross-language Wikipedia edits analyzed in the study.
The same network, with the node for the English Wikipedia removed.

Analyzing edits to the then 46 largest Wikipedias between July 9 and August 8, 2013, a study[1] identified a set of about 8,000 contributors (labeled multilingual) with a global user account who have edited more than one of these language versions (excluding Simple English, which was treated separately) in that time frame. It tested five hypotheses about cross-language editing and editors and looked, for instance, at the proportion of contributions that any of these Wikipedias receives from multilingual editors versus contributions from those only editing one language version. The research found that Esperanto and Malay stick out with a high proportion of contributions from multilinguals, and on the other end, that Japanese has few contributions from multilinguals. Overall, in terms of edits per user, multilingual users made more than twice the number of contributions to the study corpus than monolinguals did; they often work on the same topics across language; and in any given language, they are frequently editing articles not edited by monolinguals during the one-month period analyzed here. They thus serve a bridging function between languages.

Two existing write-ups are good starting points to putting the study in context.[supp 1][supp 2] In the long run, it would be interesting to extend the research to (a) cover a longer time span, (b) include contributions from non-registered users, despite technical difficulties, (c) include smaller Wikipedias, and (d) explore the effects of that bridging function in more detail, perhaps in search for ways to support its beneficial effects while minimizing the non-beneficial ones. It would also be interesting to focus on some aspects of those multilingual users (e.g. how do the languages they edit in match with the languages they display on their user pages) or their contributions (e.g. how do their contributions to text, illustrations, references, links, templates, categories or talk page discussions differ across languages, or how contributions from multilinguals differ across topics or between pages with high and low traffic – or to entertain ideas for a multilingual version of editing tools like User:SuggestBot. The paper is one of the first to make use of Wikidata; comparing such cross-lingual Wikipedia contributions with contributions to multi-lingual projects like Wikidata and Commons may also be a fruitful avenue for further research. (See also earlier coverage of a CSCW paper about a similar topic: "Activity of content translators on Wikipedia examined")

Attempt to use Wikipedia pageviews to predict election results in Iran, Germany and the UK

A new paper on arXiv asks the question "Can electoral popularity be predicted using socially generated big data?"[2] Operating on the assumption that "sentiment data is implied in information seeking behaviour," the authors Yasseri and Bright compare Wikipedia page views and Google search trends to election outcomes in Iran, Germany and the UK. In Iran and the UK, where the researchers were able to use the articles of individual politicians, the page view and search trend data correctly pick the winners of the elections. In the UK, the data polled even correctly picks the orders of the runners-up, but the same is not true for Iran. In the German case, no correlation is found between search data and election results. Yasseri and Bright defer to the argument from previous studies on Twitter prediction that conclude that the sample data is too self-selecting. Overall, it is shown that "people do not simply search in the same proportions that they vote." Still the researchers note that these techniques react "quickly to the emergence of new 'insurgent' candidates."

Integrity of Wikipedia and Wikipedia research

A book titled Confidentiality and Integrity in Crowdsourcing Systems contains a chapter on the integrity of the English Wikipedia as a case study of integrity management in crowdsourcing systems.[3] To test the integrity of Wikipedia, they first tried to start a new article with "invalid content" (it got deleted) and then turned to vandalizing pages systematically, both of which violates Wikipedia policies (cf. Wikipedia:Vandalism). They noted that simple cases were caught by automated counter-vandalism tools (ClueBot and XLinkBot, whose user pages – one of them with a typo – are the only references cited in the chapter), whereas more subtle cases ("incorrect information containing words related to the page’s topic" or adding external links present in related Wikipedia articles) were not. No indication was given as to whether these inappropriate edits had later been removed (by the authors themselves or by other users), nor what the affected pages were or what IP address(es) they had used to make those edits.

In a next step, the authors went through dumps of the English Wikipedia from 2001 to 2011 and analyzed revision histories for "100 good and featured articles" (which refers to Wikipedia:Good articles and Wikipedia:Featured articles – later, they call this set "high-quality articles") and "100 non-featured articles" (by which they mean neither good nor featured – later, they refer to this set as "low-quality articles"). In this sample (of which no further details are given), they observed that the number of contributions to high-quality articles is about one order of magnitude higher than that of low-quality articles and "that there is a highly active group of contributors involved from the creation of high quality articles until present", while most editors to low-quality articles never contributed to those pages again. They then looked at revert rates, at the overlap between sets of top contributors to a given article across years, and at the range of topics edited by top contributors to an article, observing that "the top contributors have become the owners of high quality articles and their engagement has increased" (which runs contrary to WP:OWN), "[T]his results in higher quality for a small portion of articles in Wikipedia" and "[T]op contributors of high quality articles are more like- minded than the top contributors of low quality articles", concluding "that the main difference between low quality and featured articles is the number of contributions."

From that, they venture into extrapolating to crowdsourcing systems more generally: "[w]e observe that to have higher integrity in crowdsourcing systems, we need to have a permanent set of contributors who are dedicated for maintaining the quality of the contributions to the articles. For systems with open access such as Wikipedia, this can be a huge burden for the permanent editors. Therefore, we need new mechanisms for coordinating the activities in a crowdsourcing information system." No discussion of these new mechanisms is offered.

The chapter has a few simple tables and plots but no link to the underlying data nor the code used for the analysis, nor links to relevant literature or Wikipedia policies, but it is paywalled behind a price tag of $29.95 / €24.95 / £19.95. Given that the experimental edits to Wikipedia actually damaged the project, it is hard to imagine that an ethical review panel involving Wikipedians might have approved the study in that form. In fact, such a panel does exist in the form of the Research Committee, which had not been contacted about the project. Considering further that the conclusions of the study are not new, their possibly interesting implications for crowdsourcing more generally are not discussed and neither the paper nor its materials are available to those concerned about the integrity of Wikipedia, it is hard to see any benefit of this study that would outweigh the damage it caused (cf. earlier coverage: "Link spam research with controversial genesis but useful results", "Traffic analysis report and research ethics").

Briefly

References

  1. ^ Hale, Scott A. (2013). "Multilinguals and Wikipedia Editing". Proceedings of the 2014 ACM conference on Web science - Web Sci '14. pp. 99–108. arXiv:1312.0976. doi:10.1145/2615569.2615684. ISBN 9781450326223.
  2. ^ Taha Yasseri; Jonathan Bright (2013). "Can electoral popularity be predicted using socially generated big data?". arXiv:1312.2818 [physics.soc-ph].
  3. ^ Ranj Bar, A.; Maheswaran, M. (2014). "Case Study: Integrity of Wikipedia Articles". Confidentiality and Integrity in Crowdsourcing Systems. SpringerBriefs in Applied Sciences and Technology. p. 59. doi:10.1007/978-3-319-02717-3_6. ISBN 978-3-319-02716-6. Closed access icon
  4. ^ https://fosdem.org/2014/schedule/event/how_we_found_600000_grammar_errors/ (abstract only)
  5. ^ Azer, S. A. (2014). "Evaluation of gastroenterology and hepatology articles on Wikipedia". European Journal of Gastroenterology & Hepatology. 26 (2): 155–63. doi:10.1097/MEG.0000000000000003. PMID 24276492. Closed access icon
  6. ^ Benkler, Yochai, Aaron Shaw, and Benjamin Mako Hill. "Peer Production: A Modality of Collective Intelligence". (draft paper) http://mako.cc/academic/benkler_shaw_hill-peer_production_ci.pdf
  7. ^ Jason Stacy, Cory Blad, and Rob Velella. "Morbid Inferences: Whitman, Wikipedia, and the Debate over the Poet's Sexuality". Polymath: An Interdisciplinary Arts and Sciences Journal, Vol. 3, No. 4, Fall 2013 https://ojcs.siue.edu/ojs/index.php/polymath/article/view/2857
  8. ^ Bernard Fallery, Florence Rodhain. "Gouvernance d'Internet, gouvernance de Wikipedia : l'apport des analyses d'E. Ostrom". Management et Avenir, 65 (2013) 168–187, http://hal.archives-ouvertes.fr/docs/00/92/05/08/PDF/2013_RMA.pdf (in French, with English abstract)
  9. ^ Meyer, Christian M. Wiktionary: "The Metalexicographic and the Natural Language Processing Perspective". Technische Universität Darmstadt, Darmstadt Ph.D. Thesis], (2013) http://tuprints.ulb.tu-darmstadt.de/3654/
Supplementary references:

















Wikipedia:Wikipedia Signpost/2013-12-25/Recent_research