Can machine learning uncover Wikipedia’s missing “citation needed” tags?

News from the WMF

Can machine learning uncover Wikipedia’s missing “citation needed” tags?

By Miriam Redi, Jonathan Morgan, Dario Taraborelli and Besnik Fetahu

This article originally appeared in the Wikimedia Foundation blog on April 3, 2019.

We are using machine learning to predict whether—and why—any given sentence on Wikipedia may need a citation in order to help editors identify areas of content violating the verifiability policy.

One of the key mechanisms that allows Wikipedia to maintain its high quality is the use of inline citations. Through citations, readers and editors make sure that information in an article accurately reflects its source. As Wikipedia’s verifiability policy mandates, “material challenged or likely to be challenged, and all quotations, must be attributed to a reliable, published source”, and unsourced material should be removed or challenged with a citation needed flag.

However, deciding which sentences need citations may not be a trivial task. On the one hand, editors are urged to avoid adding citations for information that is obvious or common knowledge—like the fact that the sky is blue. On the other hand, sometimes the sky doesn’t actually appear blue—so perhaps we need a citation for that after all?

Scale up this problem to the size of an entire encyclopedia, and it may become intractable. Wikipedia editors’ time is limited and their expertise is valuable—which kinds of facts, articles, and topics should they focus their citation efforts on? Also, recent estimates show that a substantial proportion of articles have only a few references, and that one out of four articles in English Wikipedia does not have any references at all. This suggests that while around 350,000 articles contain one or more citation needed flags, we are probably missing many more.

We recently designed a framework to help editors identify and prioritize which sentences need citations in Wikipedia. Through a large study that we conducted with editors from English, Italian and French Wikipedia, we first identified a set of common reasons why individual sentences in Wikipedia articles require citations. We then used the results of this study to train a machine learning model classifier that can predict whether or not any given sentence needs a citation —and why—on the English Wikipedia. It will be deployed in the next 3 months to other language editions.

By improving the identification of where Wikipedia gets its information from, we can support the development of systems to help volunteer-driven verification and fact-checking, potentially increasing Wikipedia’s long-term reliability and making it more robust against biases, information quality gaps and coordinated disinformation campaigns

Why do we cite?

To teach machines how to recognize unverified statements, we first needed to systematically classify the reasons why sentences need citations.

We started by examining policies and guidelines related to verifiability in the English, French, and Italian Wikipedias and attempted to characterize the criteria for adding (or not adding) a citation described in those policies. To verify and enrich this set of best practices, we asked 36 Wikipedia editors from all three language communities to participate in a pilot experiment. Using WikiLabels, we collected editors’ feedback on sentences from Wikipedia articles: editors were asked to decide whether a sentence needed a citation and to specify a reason for their choices in a free-text form.

Our methods and our final set of reasons for adding or not adding a citation can be found on our project page.

Teaching a machine to discover citation gaps.

Next, we trained a machine learning model to discover sentences needing citations, and characterize them with a matching reason.

We first trained a model to learn from the wisdom of the whole editor community how to identify sentences that need to be cited. We created a dataset of English Wikipedia’s “featured” articles, the encyclopedia’s designation for articles that are of the highest quality—and also the most well-sourced with citations. Sentences from featured articles that contain an inline citation are considered as positives, and sentences without an inline citation are considered as negatives. With this data, we trained a Recurrent Neural Network that can predict whether the sentence is positive (should have a citation), or negative (should not have a citation) based on the sequence of words in the sentence. The resulting model can correctly classify sentences in need of citation with an accuracy of up to 90%.

Explaining algorithmic predictions

But why is the model up to 90% accurate? What is the algorithm looking at when deciding whether a sentence needs a citation?

To help interpret these results, we took a sample of sentences needing citations for different reasons, and highlighted words the model considered the most when it classified the sentences. In the case of “opinion” statements, for example, the model assigned the highest weight to the word “claimed”. In the “statistics” citation reason, the most important words to the model are verbs that are often used in reporting numbers. In the case of scientific citation reasons, the model pays more attention to domain-specific words like “quantum”.

Predicting why a sentence needs a citation

Similar to the “reason” field of the [citation needed] tag, we want our model to also provide full explanations of citation reasons. Therefore we created a model that can classify statements needing citations with a reason. We first designed a crowdsourcing experiment using Amazon Mechanical Turk to collect labels about citation reasons. We randomly sampled 4,000 sentences that contain citations from Featured articles, and asked crowdworkers to label them with one of the eight citation reason categories we identified in our previous study. We found that sentences more likely need citations when they are related to scientific or historical facts, or when they reflect direct/indirect quotations.

We modified the neural network designed in the previous study, so that it can classify an unsourced sentence into one of the 8 citation reason categories. We retrained this network using the crowdsourced labeled data, and found that it provides reasonable accuracy (precision at 0.62) in predicting citation reasons, especially for classes with a substantial amount of training data.

Next steps: predicting “citation need” across languages and topics

The next phase of this project will involve modifying our models so that they can be trained for any language available in Wikipedia. We will use these multilingual models to quantify the proportion of unverified content across Wikipedia editions, and map citation coverage across different article topics, in order to help editors identify areas where adding high quality citations is particularly important.

We plan to make the source code of these new models available soon. In the meantime, you can check out the research paper, recently accepted at The Web Conference 2019, its supplementary material with detailed analysis of the citation policies, and all the data we used to train the models.

We would love to hear your feedback and comments, so please reach out to us on our project page to help us improve it.

The authors would like to thank the community members of the English, French, and Italian Wikipedias, along with workers from Amazon Mechanical Turk, for helping with data labeling and for their precious suggestions.

Miriam Redi is Research Scientist at the Wikimedia Foundation
Jonathan Morgan is Senior Design Researcher at the Wikimedia Foundation
Dario Taraborelli is a former Director of Research at the Wikimedia Foundation
Besnik Fetahu is a Post-doctoral Scientist at the L3S Lab Hannover

← Previous "News from the WMF"

In this issue

30 April 2019 (all comments)

News and notes

In the media

Discussion report

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

How will this change affect articles on philosophy, interpretive social sciences, and the humanities? Many articles in those areas could be improved with better citations, but some articles have fairly solid coverage without many citations beyond the few books that are being discussed. This article says that "opinions" need citations while "book plots" do not. Where do interpretive synopses of complex works fall into this? Often, insufficient book reviews exist for coverage of a humanities/social science text for the article to focus on quoting reviews (though this is usually recognized as best practice), so interpretation is necessary for much of these works, which may look a lot like opinions.

I guess I'm just worried that half of humanities wikipedia will become "unverified," or worse, "unverifiable," overnight, if the ML algorithms' sensitivity is set just a little too high, or it never gets trained on H/SS articles.- - mathmitch7 ^{(talk/contribs)} 19:04, 30 April 2019 (UTC)[reply]

Mathmitch7 This is an interesting point. Although there's no cause to be worried because there has been no "change"—no new products have been developed, no community policies have been changed. If a wiki decided to adopt this technology (for example, in a new kind of CitationBot), then they would have control over implementation. If WMF decided to incorporation citation need predictions into a MediaWiki feature or something (and there are currently no plans to do that), then they would be suggestions, not mandates. Individual wikis still determine what notability and verifiability mean. Cheers, Jmorgan (WMF) (talk) 15:29, 2 May 2019 (UTC)[reply]

Other comments have mentioned the humanities, but I work extensively in the biology projects and you might be surprised at just how many unsourced or poorly-sourced claims there are. I don't generally work with our best articles, to be fair, but this is a problem for all of enwiki--NOT just the humanities. Honestly, while I do think that this kind of tool would be extremely helpful for us in figuring out what needs citations, I also think it is the cart leading the horse. The problem with lacking citations has NEVER been that people don't understand when we need them. Sure, there are plenty of inexperienced editors who just cobble an idea together without sources because they don't know better. But I would argue strongly that the bottleneck has always been that finding and citing refs is a pain in the rear. There, I said it. What I really think that WMF should focus on to solve the WP:V problem is making research and citation easier. Look at what Microsoft and Google have been doing with their word-processing software. I think we should be BOLD and consider ideas like having research tools built into the editing interface, including (for starters) links to other WMF resources. Many journals and other sources actually create citations for you automatically, now, but they are in different formats. What about machine-learning tools to reformat citations semi-automatically (reviewed by a person) and strip information out of them? What about redesigning the code editor to improve syntax highlighting to make it easier to see refs or easier to see article text, depending on need? Those are all ways which I think would directly support the goals of WP:V very well. Don't get me wrong. I like the citation checking concept. But it isn't any good without a way to dramatically increase the amount of sourcing by editors, especially the less-dedicated. Prometheus720 (talk) 02:45, 3 May 2019 (UTC)[reply]

This! I think that's a great idea. Sometimes articles already point in a direction of a citation but don't connect all the dots ... I think ML could definitely help us out with that!
My only potential concern (which could be mitigated!) is definitely about citational politics: it seems that an ML system would likely point us toward the already over-cited resources, instead of new resources that could substantially contribute to an article. I don't think that's a problem per se, just a new technical/political challenge to consider. How do we point people toward quality resources that aren't widely used? How do we know they're quality if they're not widely used? Maybe there's a cultural reason they're not used (i.e., pseudoscience that has all the packaging of legit science but supports totally bogus claims that most people already know are to be avoided). Just a thought! - - mathmitch7 ^{(talk/contribs)} 03:11, 3 May 2019 (UTC)[reply]

@Mathmitch7: I actually was referring to just reformatting citations. Throw at it some citations with all the info needed of various types of sources, and then say, "Here, clean these up and make them look the same." Even better, it could actually follow the doi or other link and collect any additional information which might be needed, or perhaps even go out and find a doi link. I would not want to use machine learning to find new citations. That would be dicey as you pointed out. Prometheus720 (talk) 16:09, 4 May 2019 (UTC)[reply]

Pages about citations:

--Guy Macon (talk) 15:49, 4 May 2019 (UTC)[reply]

Why are the images in this article... images? For someone using a screen reader, or with images turned off, they provide no information. The image at the top of the article is decorative, fine. The distribution of reason labels might be difficult to turn into a text explanation. Understandable.

But 'Reasons for adding a citation', 'Reasons for not adding a citation', and 'Examples of sentences that need citations according to our model, with key words highlighted': Why are these images and not text? All three could be communicated as effectively in text, without the accessibility failure. The first two are especially bad. There's no good reason for these to be images and not text. If you (the people who wrote the article, the people who created or added the images, the Signpost editors) thought about this and made the decision to use images rather than text, why did you not add alt text?

I would fix it myself were I more expert in the use of Commons and editing of image files here. That wouldn't, however, change the copies of Signpost that are on talk pages or in other locations.

Please read the section on images on the Accessibility page of the Manual of Style, and please, don't do this again. BlackcurrantTea (talk) 05:28, 13 May 2019 (UTC)[reply]

Keep up with The Signpost on Twitter, Facebook or Mastodon.

Home

About