The Signpost

Recent research

Disease outbreak uncertainties, AfD forecasting, auto-updating Wikipedia

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.


See also the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

"Uncertainty During New Disease Outbreaks in Wikipedia"

From the abstract and the discussion section:[1]

"New disease outbreaks [e.g. Ebola, MERS, Swine influenza] are often characterized by emergent and changing information which, in turn, require Wikipedia editors to spend time and effort to retrieve and understand information that is sometimes ambiguous, complex, and contradictory. [...] the goals of this study are to identify types of uncertainty expressed by Wikipedia editors during new disease outbreaks, and examine different strategies deployed by Wikipedia editors to manage uncertainty. [...]

Wikipedia editors depend on several strategies to cope with uncertainty during a disease outbreak. These strategies rely primarily on consulting authoritative sources, reporting the uncertainty to the public, ignoring the uncertainty in the interests of maintaining simplicity, and, to a far lesser extent, setting up a mailing list to gather information and science as they emerge over time."

"Analyzing Wikipedia Deletion Debates with a Group Decision-Making Forecast Model"

From the abstract:[2]

"we show that machine learning with natural language processing can accurately forecast the outcomes of group decision-making in online discussions. Specifically, we study Articles for Deletion, a Wikipedia forum for determining which content should be included on the site. Applying this model, we replicate several findings from prior work on the factors that predict debate outcomes; we then extend this prior work and present new avenues for study, particularly in the use of policy citation during discussion. Alongside these findings, we introduce a structured corpus and source code for analyzing over 400,000 deletion debates spanning Wikipedia's history."

"Science Is Shaped by Wikipedia: Evidence From a Randomized Control Trial"

From the abstract and discussion section:[3]

"Incorporating ideas into Wikipedia leads to those ideas being used more in the scientific literature. We provide correlational evidence of this across thousands of Wikipedia articles and causal evidence of it through a randomized control trial where we add new scientific content to Wikipedia. In the months after uploading it, an average new Wikipedia article in Chemistry is read tens of thousands of times and causes changes to hundreds of related scientific journal articles. Patterns in these changes suggest that Wikipedia articles are used as review articles, summarizing an area of science and highlighting the research contributions to it. Consistent with this reference article view, we find causal evidence that when scientific articles are added as references to Wikipedia, those articles accrue more academic citations. [...]

For each Wikipedia article that we created for this experiment we paid students $100. Assuming one Wikipedia article (or equivalent contribution) per research paper, the implicit tax on research would be ($100/$220,000 ) = 0.05%. [...] even with many conservative assumptions, dissemination through Wikipedia is ∼ 120× more cost-effective than traditional dissemination techniques."

This research caused community discussions that ultimately led to the creation of a "Wikipedia is not a laboratory" policy on the English Wikipedia.

"'This is exactly how the Nazis ran it': (De)legitimising the EU on Wikipedia"

From the abstract:[4]

"The data examined consist of Wikipedia contributors' debates that took place on a Wikipedia discussion site ('talk page'). Taking a corpus-assisted approach combined with argumentation analysis and aspects of systemic functional linguistics, I found that Wikipedia editors repeatedly propose that Nazi Germany might have been a precursor of the EU today. However, the Wikipedia community ultimately rejects this notion and emphasises the voluntary nature guiding the EU's creation process. Thus, while the EU's legitimacy is indeed contested in the course of the debates, the Wikipedia community eventually rejects this challenge."

"The Dynamics of Peer-Produced Political Information During the 2016 U.S. Presidential Campaign"

From the abstract:[5]

"Drawing on systems justification theory and methods for measuring the enthusiasm gap among voters, this paper quantitatively analyzes the candidates’ biographical and related articles and their editors. Information production and consumption patterns match major events over the course of the campaign, but Trump-related articles show consistently higher levels of engagement than Clinton-related articles."


"Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia"

From the tool documentation and abstract:[6]

Wikipedia2Vec is a tool for learning embeddings of words and entities from Wikipedia. The learned embeddings map similar words and entities close to one another in a continuous vector space.

This tool learns embeddings of words and entities by iterating over entire Wikipedia pages and jointly optimizing the following three submodels:

  • Wikipedia link graph model, which learns entity embeddings by predicting neighboring entities in Wikipedia's link graph [...]
  • Word-based skip-gram model, which learns word embeddings by predicting neighboring words given each word in a text contained on a Wikipedia page.
  • Anchor context model, which aims to place similar words and entities near one another in the vector space[ ...]

The embeddings of entities in a large knowledge base (e.g., Wikipedia) are highly beneficial for solving various natural language tasks that involve real world knowledge. In this paper, we present Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia. [...] We also introduce a web-based demonstration of our tool that allows users to visualize and explore the learned embeddings."


"Introduction to Neural Network based Approaches for Question Answering over Knowledge Graphs"

From the abstract:[7]

" ...we provide an overview over [...] recent advancements [in question answering research], focusing on neural network based question answering systems over knowledge graphs [including "the most popular KGQA datasets": 8 based on Freebase, 2 on DBPedia, one on DBpedia and Wikidata]. We introduce readers to the challenges in the tasks, current paradigms of approaches, discuss notable advancements, and outline the emerging trends in the field."


"Automatic Fact-guided Sentence Modification"

From the abstract:[8]

"Online encyclopediae like Wikipedia contain large amounts of text that need frequent corrections and updates. The new information may contradict existing content [....] we focus on rewriting such dynamically changing articles. [...] To this end, we propose a two-step solution: (1) We identify and remove the contradicting components in a target text for a given claim, using a neutralizing stance model; (2) We expand the remaining text to be consistent with the given claim, using a novel two-encoder sequence-to-sequence model with copy attention. Applied to a Wikipedia fact update dataset, our method successfully generates updated sentences for new claims... "

See also university press release: "Automated system can rewrite outdated sentences in Wikipedia articles" ("Text-generating tool pinpoints and replaces specific information in sentences while retaining humanlike grammar and style") and media coverage.


"Transforming Wikipedia into Augmented Data for Query-Focused Summarization"

This preprint[9] presents a query-focused summarization dataset using Wikipedia's citations to align queries and documents.

"Knowledge Graphs and Knowledge Networks: The Story in Brief"

This summary of the journey of knowledge graphs for Artificial Intelligence[10] also covers Wikidata:

"Wikidata (wikidata.org/) is wikipedia’s open-source machine-readable database with millions of entities where everyone can contribute and use (with reading and editing permissions) with a user-friendly query interface.

It covers a wide variety of domains and contains not only textual knowledge but also images, geocoordinates, and numerics. Wikidata uses unique identifiers for each entity/ relation for accurate querying and provides provenance metadata, unlike DBpedia and schema.org. For instance, it includes information about a fact’s correctness in terms of its origin and temporal validity (reference point of time during of the fact). Wikidata is one of the latest projects acknowledging the dynamic nature of KG and is continuously updated by human contributors unlike DBpedia which is curated from wikipedia once in a while."

"Strangers in a seemingly open-to-all website: the gender bias in Wikipedia"

From the abstract:[11]

"Based on action research with a mixed evaluation method and two rounds of interviews, the research followed the steps of 27 Israeli women activists who participated in editing workshops.

Findings: [...] having the will to edit and the knowledge of how to edit are necessary but insufficient conditions for women to participate in Wikipedia. The finding reveals two categories: pre-editing barriers of negative reputation, lack of recognition, anonymity and fear of being erased; and post-editing barriers of experiences of rejection, alienation, lack of time and profit and ownership of knowledge. The research suggests a “Vicious Circle” model, displaying how the five layers of negative reputation, anonymity, fear, alienation and rejection enhance each other, in a manner that deters women from contributing to the website."

References

  1. ^ Tamime, Reham Al; Hall, Wendy; Giordano, Richard (2019-07-06). "Uncertainty During New Disease Outbreaks in Wikipedia". Proceedings of the International AAAI Conference on Web and Social Media. 13 (1): 38–46. ISSN 2334-0770.
  2. ^ Mayfield, Elijah; Black, Alan W. (November 2019). "Analyzing Wikipedia Deletion Debates with a Group Decision-Making Forecast Model". Proc. ACM Hum.-Comput. Interact. 3 (CSCW): 206–1–206:26. doi:10.1145/3359308. ISSN 2573-0142. Author's copy
  3. ^ Thompson, Neil; Hanley, Douglas (2019-08-16). "Science Is Shaped by Wikipedia: Evidence From a Randomized Control Trial". MIT Sloan Research Paper. doi:10.2139/ssrn.3039505.
  4. ^ Kopf, Susanne (2020-02-07). "'This is exactly how the Nazis ran it': (De)legitimising the EU on Wikipedia". Discourse & Society: 0957926520903524. doi:10.1177/0957926520903524. ISSN 0957-9265. Closed access icon
  5. ^ Keegan, Brian C. (November 2019). "The Dynamics of Peer-Produced Political Information During the 2016 U.S. Presidential Campaign". Proc. ACM Hum.-Comput. Interact. 3 (CSCW): 33–1–33:20. doi:10.1145/3359135. ISSN 2573-0142. Closed access icon Author's copy
  6. ^ Yamada, Ikuya; Asai, Akari; Sakuma, Jin; Shindo, Hiroyuki; Takeda, Hideaki; Takefuji, Yoshiyasu; Matsumoto, Yuji (2020-01-30). "Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia". arXiv:1812.06280 [cs.CL].
  7. ^ Chakraborty, Nilesh; Lukovnikov, Denis; Maheshwari, Gaurav; Trivedi, Priyansh; Lehmann, Jens; Fischer, Asja (2019-07-22). "Introduction to Neural Network based Approaches for Question Answering over Knowledge Graphs". arXiv:1907.09361 [cs.CL].
  8. ^ Shah, Darsh J.; Schuster, Tal; Barzilay, Regina (2019-12-02). "Automatic Fact-guided Sentence Modification". arXiv:1909.13838 [cs.CL].
  9. ^ Zhu, Haichao; Dong, Li; Wei, Furu; Qin, Bing; Liu, Ting (2019-11-08). "Transforming Wikipedia into Augmented Data for Query-Focused Summarization". arXiv:1911.03324 [cs.CL].
  10. ^ Sheth, Amit; Padhee, Swati; Gyrard, Amelie (July 2019). "Knowledge Graphs and Knowledge Networks: The Story in Brief". IEEE Internet Computing. 23 (4): 67–75. doi:10.1109/MIC.2019.2928449. ISSN 1941-0131. Closed access icon
  11. ^ Lir, Shlomit Aharoni (2019-01-01). "Strangers in a seemingly open-to-all website: the gender bias in Wikipedia". Equality, Diversity and Inclusion: An International Journal. ahead-of-print (ahead-of-print). doi:10.1108/EDI-10-2018-0198. ISSN 2040-7149. Closed access icon Copy at Academia.edu


+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
  • Perhaps a more constructive quote from the last paper listed, on our editorial gender gap:

Practical implications-In order for more women to join Wikipedia, the research offers the implantation of a "Virtuous Circle" that consists of nonymity, connection to social media, inclusionist policy, soft deletion and red-flagging harassments.

I see a future RfC mentioned in this Signpost issue is relevant to the last point. The second one is something that Women in Red has worked on a lot. The third and fourth is something to think about at AfC, NPP, AfD and similar. I'm not quite sure what can be done on the matter of the first, though something I've been thinking for a while is that people should hand out barnstars more liberally. — Bilorv (talk) 22:51, 29 March 2020 (UTC)[reply]
The paper goes pretty specific with suggestions on each of those. For the first:

... Insisting on a website based on true names and pictures will also allow a much-needed concept of situated and contextualized aspects of knowledge. ... This step is especially important in regard to the overt identities of bureaucrats and system admins who have tremendous power over others on the website, including erasing entries and banning users. Position holders must be obliged to volunteer in exposed identities in order to contribute to an organizational climate of safety, based on familiarity and accountability.

Didn't finish the paper but would want to see it address the idea that exposing one's identity might make the editor more susceptible to targeted harassment. czar 01:17, 30 March 2020 (UTC)[reply]
CC User:שלומית_ליר. Regards, HaeB (talk) 04:13, 30 March 2020 (UTC)[reply]
Indeed, @Czar:. I'd also like to see how they'd handle concerns about just a reduced pool of admins (et al) being less effective at fulfilling their roles, if there was a significant number of resignations, which I believe there would be. I wouldn't have participated in handling the Delhi riots page if I had a public identity. Nosebagbear (talk) 16:37, 30 March 2020 (UTC)[reply]
I would also be concerned about this particular way of implementing nonymity. It's the central fact about the internet: anonymity leads to many brilliant communities, but also many terrible ones. Nonetheless, I think we can definitely do something about the other four findings. — Bilorv (talk) 18:03, 30 March 2020 (UTC)[reply]
I quite agree with Bilorv. Not reaching the goal of 25% participation by women is a huge waste of potential, and the reasons given in this research are really disheartening. I picked out the phrase "the fear of being erased" from the report for The Signpost's teaser blurb, because it was so striking.
At the same time, this research commits an error I've seen before, when it describes the deletion of an article which I located on Hebrew Wikipedia. The paper states it was impossible to request a proper debate and it was deleted without proper discussion. However, unless I'm very mistaken, the article was in fact a copyright violation (copyvio). Not only that, but it was restored after the appropriate people received permission via OTRS. So, a very misleading description. ☆ Bri (talk) 19:45, 30 March 2020 (UTC)[reply]
Thanks for your comments and for the overall interesting discussion. As to this specific remark, the speedy deletion was not related to copyright problems. In fact, the writer's draft was shown to David Shi, who was the head Wikipedian at the time, beforehand. Years later a very similar article under the same name was published in Wikipedia by someone else. שלומית ליר (talk) 07:27, 1 April 2020 (UTC)[reply]
Thanks for clarifying. I was very mistaken, after all. ☆ Bri (talk) 05:32, 2 April 2020 (UTC)[reply]
We as a community really have a tough job of keeping our core principles, such as not allowing copyvios or un-referenced additions, also while not kicking new good-faith editors to the curb, and not confronting them with a maze of jargon and legalisms that make them perpetual outsiders. What may seem to the initiated as impersonal and routine procedures is apparently coming across – at least to some – as distancing and unwelcoming, if not outright contemptuous. We need to take that to heart. ☆ Bri (talk) 19:45, 30 March 2020 (UTC)[reply]
The paper repeats most of the well-known issues that have been heavily discussed for many years, but talked only to women, and falls into the trap at many points of assuming these are women-specific, which we know they are not. They also use too many old references; papers from before say 2010, and certainly 2007 are likely to be badly out of date and should not be used these days. Johnbod (talk) 15:27, 31 March 2020 (UTC)[reply]
My takeaway is that there is such a thing as norms of interaction that are informed by gender of each participant. The paper has told us several problems with the hard-nosed communication style common in male-dominated STEM fields where many of our early contributors come from. And how that cadre of early contributors has coalesced into difficult-to-influence norms and culture, including our processes, templates, and what have you. ☆ Bri (talk) 18:32, 31 March 2020 (UTC)[reply]
I'm less than impressed with the recommendations; the author's enthusiasm for social media networks despite their increasingly well documented flaws is rather striking. One suggestion in a footnote is that Wikipedia should have a biography of every single human being. --RaiderAspect (talk) 10:56, 1 April 2020 (UTC)[reply]
Thanks for your comment. Nevertheless, to be accurate, I recommended a future Wiki project documenting every individual who allows the publication of his or her life story online. As I mentioned in the footnote, I believe that the understanding of what constitutes knowledge will widen with time, and with it the perception of the importance of open-to-all online documentation of more human lives.
As to the flaws of social media that you mention, I am well aware of them. I recommended allowing Wikipedians' user-pages to be associated with their Social Media profiles (if they wish to do so) as a means of achieving a much-needed sense of safety. As research points out, bullying thrives in anonymous online environments. As strange is it might sound, and in view of the long history of anonymity in the website, the principle of nonymity is crucial to overcoming the gender bias in Wikipedia. שלומית ליר (talk) 09:09, 2 April 2020 (UTC)[reply]
Thank you for your reply! I apologise for my misreading, although I'm afraid that I still feel such a project would be more akin to LinkedIn or Facebook than Wikipedia. The fact that no individual 'owns' a page is the essence of Wikipedia, at least in my opinion.
Regarding nonymity, the issue I see is that while it would probably make Wikipedia itself a friendlier place, it would also facilitate more serious harassment of Wikipedia users outside of Wikipedia. Being told that your work is worthless and that your beliefs are nonsense is hurtful, but having someone sending threats to your friends and family or calling your employer to make accusations is an order of magnitude moreso. And for users in certain countries associating their user-page with their Social Media profiles could put them in physical danger.
I realize you may be perfectly aware of these issues and simply didn't have room to discuss them in your paper. Please don't feel obliged to get dragged into a debate here, I'm merely an interested observer. --RaiderAspect (talk) 13:42, 2 April 2020 (UTC)[reply]

A very interesting discussion with good points on all sides. For me, the salient point is that Wikipedia combines an increasingly strict insistence on quality (citations, neutrality, non-commerciality, copyright, and all the rest) with rather little in the way of training and apprenticeship for newcomers. Helping even an able, willing, and co-operative newbie into editing an area effectively is quite a lot of work, specially for the first week or two, and such coaching requires expertise, energy and teaching skill, all quantities in quite short supply. But throwing newcomers straight into the rigours of editing live articles with no training seems increasingly drastic; it was alarming being a newbie over a decade ago, and now it's certainly worse. Other measures than apprenticeships and coaching are imaginable: we could encourage people to take an online tutorial; we could have a pop-up box asking new editors to make their suggestion on an article's talk page until they get the hang of things; there could be an automated 20-questions test so newcomers could see what skills they needed to get started; and so on. And of course, a safe place for new female editors in particular would be very welcome. Chiswick Chap (talk) 09:01, 2 April 2020 (UTC)[reply]

Chiswick Chap: As a relative newcomer (although I guess I've now been here for over three years), I can attest that it is not very easy to learn the ins-and outs of Wikipedia. I don't recall having much help figuring things out, and I think I just got lucky by meeting very patient editors like Display Name 99 didn't get angry when I made mistakes. All I remember is that it felt to me like Wikipedia was empty and I was pretty alone. For me, I think that helped because I took my time, writing a draft article first, mainly just adding citations in the mainspace until things started to make sense. With that being said, a lot of my work was definitely sub-par, and I think I'm very lucky nobody scared me away from editing. As I remember, the first automated message I got just seemed very impersonal and didn't help at all because it just overwhelmed me with things I wasn't interested in at all. On the other hand, I remember The Wikipedia Adventure actually as a big help initially. So, in summery, I'd say that the biggest thing we can do is have patience and not not bite the newcomers. Automated messages only overwhelmed me, but more interactive things like the test or box you suggest may have a better effect. Again, this is only based on my experience, and I don't think it's universally applicable. Eddie891 Talk Work 18:35, 3 April 2020 (UTC)[reply]
  • @RaiderAspect and שלומית ליר: - those are certainly the worst "outside wiki" risks that would come from reduced anonymity, but there are also other possibilities. Employers that do social media checks of employees and/or applicants, could now more easily be found. They might be unhappy in some cases, for myriad reasons, but they could also be too happy - at my previous job (back when I'd barely started editing), a trainee colleague had to do some deft talking to avoid from being basically forced into paid editor status (on the wiki side) and being held responsible for what got added onto the company page (on his employment side). Nosebagbear (talk) 10:38, 3 April 2020 (UTC)[reply]

Science Is Shaped by Wikipedia: Evidence From a Randomized Control Trial

This research is from 2017, as are the linked discussions. I don't understand why it is listed in A monthly overview of recent academic research about Wikipedia [...]. —⁠andrybak (talk) 08:40, 22 April 2020 (UTC)[reply]

@Andrybak: The cited revision of the paper is from 16 Aug 2019. That said, you are correct that we interpret "Recent" liberally here - our backlog of papers to cover (writeups are always welcome!) goes back several years, although ideally we want to cover new ones within a few months of their publication. It has never been the intention though to limit the scope to only things from the past month; the assumption is that the highlighted research results will stay relevant for much longer.
Regards, HaeB (talk) 21:12, 25 April 2020 (UTC)[reply]

















Wikipedia:Wikipedia Signpost/2020-03-29/Recent_research