The Signpost

File:How_wikiannotate.org_works.png
Sage Ross (no AI, believe it or not!)
CC-BY-SA
300
Forum

WikiAnnotate: help us build a dataset of article quality evaluations

Editor's note – if you want to know more about how annotated datasets can be used to build tools for the community, as referenced in this article, you can start at Machine learning#Supervised learning.

TL;DR

I'm working with a team of researchers to collect a high-quality dataset of fine-grained Wikipedia article assessments. Experienced editors (with at least 1,000 edits) can contribute — and get paid for it — at wikiannotate.org. We'll use this dataset to build better automated article assessment tools.

Background

I've been working at Wiki Education since 2014, building software — like the Wiki Education Dashboard — to support programs that bridge the gap between Wikipedia and academia. Our flagship program — the Wikipedia Student Program — supports hundreds of higher education courses and thousands of students every term, as professors guide their students to improve Wikipedia in their areas of expertise and interest.

The widespread adoption of AI tools has been highly disruptive — as with many online domains — to Wiki Education and our work training student editors how to contribute effectively to the sum of all human knowledge. Teaching students how Wikipedia works — and how to reliably know things and share knowledge in ways that go beyond "just trust the AI" — is more important than ever (both for Wikipedia and for the students who are learning to learn in this AI-centric information environment). You can read a recap of much of our recent work in this area, but I think the impacts AI will have on Wikipedia are just beginning.

We can and will continue adapting to the changing landscape of AI usage, but one of the things holding us back is that we don't have good tools for measuring article quality systematically and automatically. The best software tool we currently have for automatically measuring aspects of article quality — Wikimedia Foundation's ‘articlequality’ model (formerly ORES) — can't differentiate between great content written by an experienced Wikipedian and an AI-slop imitation of what a great Wikipedian would write. It uses some basic metrics, like the amount of text, number of citations, headers, images, and so on, to predict the quality of an article, but can't address anything involving the quality or accuracy of the writing itself.

For Wiki Education's programs, we have one powerful tool for catching slop: the Wiki Education Dashboard integrates with the AI detection service Pangram, automatically scanning larger edits for signs of LLM-generated text. For samples of at least a few hundred words, Pangram is very good at sorting human-written prose from text that came straight out of an LLM. However, real-world AI usage patterns are much more complicated, ranging from minor copyedits to LLM-generated text that gets extensively rewritten by hand (and everything in between). In many cases — like the increasingly AI-centric Grammarly service — it's not even obvious to a student just how much of their text came out of an LLM, because AI tools get integrated into conventional text editors. We can warn a student when we detect a high likelihood of LLM text, but that kind of strategy creates an antagonistic relationship. Students perceive that they've been accused of cheating with AI, and become defensive — and still don't get a clear indication of what the AI did badly or why we have rules against AI-written article content.

Hallucination is fundamental to the way LLMs work, but they can do a pretty good job in some respects: recent models can write understandable prose about encyclopedic topics, and they can generally follow our style guidelines when prompted to do so. Some of the things they do very badly — like accurately representing the content of individual sources — are also harder for a human to notice. (I've come to think of it like this: LLMs think they've read every book, but haven't actually read any. Everything they've trained on is a muddled mix, so they can't accurately represent any single source without accessing it directly.) But it's now possible to do much better.

wikiannotate.org

We can build tools that use LLMs to explicitly evaluate an article against many aspects of our policies, guidelines and quality standards (like the detailed quality rubric of WP:ASSESS), and we can check against some of the ways we know AI usually fails catastrophically (like confabulating citations to sources that the AI didn't actually access).

That's what the "Wiki Education in the Age of Generative AI" research team is working on with wikiannotate.org. We want to collect a good dataset of fine-grained article quality assessments from experienced Wikipedians — covering general aspects of quality as well as some of the specific things that AI usually does wrong — so that we can build a tool for quantifying the ways that AI usage impacts article quality. We're looking for editors to help build this dataset, with compensation available for each completed batch of evaluations. Currently we’re offering $21 USD for each batch of 5 articles.

With help from the Wikipedia editing community, we can build on the things that LLMs do well to mitigate some of the problems they are causing. Some of the possible applications include:

  • Automatic detection of and feedback about article quality problems in the work of new editors (including the typical problems we've seen from inexperienced editors since well before the advent of ChatGPT)
  • When paired with AI detectors, a system for catching AI-induced problems even when the AI detector alone doesn't provide certainty (which is typically the case for mixed writing where both human and AI were involved)
  • Systematically measuring article quality changes over time (for Wiki Education's programs and beyond)
  • On-demand tools for editors looking for ways to improve their articles and spot common problems

If you want to help, visit wikiannotate.org to sign up and do some article assessments. Each batch is expected to take 30 to 60 minutes on average, and you can complete multiple batches.

(All these em-dashes are my own. I've been overusing em-dashes my entire adult life, and I'm not about to stop.)


+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

The project seems promising. I've tried to register but it seems that only contributors with more than 1,000 contributions to the English Wikipedia are eligibles. This raises the question of multilinguism. Very often NLP projects are in English and other languages are not even considered. This is known as Bender Rule in the literature. Is your project English only? Is there any plan to include more languages? PAC2 (talk) 18:46, 22 May 2026 (UTC)[reply]

  • I'm not exactly what you're trying to do here, but WP:GAN and WP:FAC both track how many reviews/assessments an editor has done. Might be worth asking them to participate.--Sturmvogel 66 (talk) 23:11, 22 May 2026 (UTC)[reply]
  • This is an interesting study, and I always appreciate when a research project draws on the nuanced expertise of Wikipedians when evaluating outputs. Having done a batch of annotations, however, I found it challenging to convey my actual assessment of the articles through the questions being posed. I do quite a lot of article evaluation (at Articles for Creation, New Page Patrol, and the Good Article process) and the most important qualities I am checking in student work and in possibly-AI-generated work are notability, encyclopedic tone, and verifiability. The questions here allowed be to express some concerns about source-text integrity (though I badly wanted to be able to note when a source supported only part of a claim, a particularly common AI problem) but not these other concerns. I hope the data you collect will allow you to answer these valuable questions, and I hope you will consider a multi-phase data collection process to refine the information you collect. ~ le 🌸 valyn (talk) 00:57, 23 May 2026 (UTC)[reply]
  • I've run into a bug—while completing a batch, my browser crashed (unrelated), and when I returned to the site, the button "Annotate another batch" would only return "Current batch is not complete". However, there seems to be no way to return to the current batch. Also, your contact information only seems to be available on the consent page, not the main page once preliminary information has been given. Crow Basket (talk) 20:56, 24 May 2026 (UTC)[reply]
  • Does not allow you to flag plagiarism/copyvio of a source—is this intentional? (t · c) buIdhe 20:32, 25 May 2026 (UTC)[reply]
    Good point, that's another thing that is top of my mind when evaluating articles, and a particular problem with both newcomer and LLM-generated text. ~ le 🌸 valyn (talk) 20:45, 25 May 2026 (UTC)[reply]


Hi all! I am part of the research team behind this study. Thanks to everyone who engaged in the annotation exercise and for all your feedback on it. I want to try to respond here to the main points that were raised.

---

Is your project English only? Is there any plan to include more languages?

@PAC2 for now, we do not have plans to expand to more languages. You are right in pointing out that the vast majority of NLP projects target only English, and I agree that we need more multilingual studies, but as of now, we do not have enough resources or community interest to support that scale of data collection. Given the scope of our project, in any case, it makes sense to start with English: we are analyzing edits in the context of WikiEdu, which only engages students in the United States and Canada.

---

the most important qualities I am checking in student work and in possibly-AI-generated work are notability, encyclopedic tone, and verifiability. The questions here allowed be to express some concerns about source-text integrity (though I badly wanted to be able to note when a source supported only part of a claim, a particularly common AI problem) but not these other concerns.

and

Does not allow you to flag plagiarism/copyvio of a source—is this intentional?

Thanks for pointing those out, @le 🌸 valyn and @buIdhe (no, it is not intentional). This is all extremely useful. Our rubric was largely inspired by this old community-driven metric used by WP:USPP. Clearly, that assessment is quite old and omits several important dimensions. We tried our best to augment it with things we thought would be relevant, especially considering common pitfalls of AI around fluffy verbosity and hallucinated/unrelated sources. But we would love Wikipedians' input to come up with a better version of this rubric: ideally, this would also be strongly community-driven and the result of many different perspectives. In this sense, I would definitely support an open multi-phase data collection process, taking into consideration yours and others' insights. Realistically, whether we'll be able to do that (and think about other potential expansions, e.g., the multilinguistic element surfaced by @PAC2) largely depends on the community response to this initial collection, and how much interest there is for the kind of tools that we will build.

---

I've run into a bug—while completing a batch, my browser crashed (unrelated), and when I returned to the site, the button "Annotate another batch" would only return "Current batch is not complete". However, there seems to be no way to return to the current batch. Also, your contact information only seems to be available on the consent page, not the main page once preliminary information has been given.

@Crow Basket, I am sorry you encountered a bug: I will look into this and will try to manually fix the database entry for your submission. As for the contact information, I agree that it should be available at all times: I will add it to the rest of the website. In the meantime, you can refer to our contacts on our research page.

---

Once again, thanks to everyone who has donated their time to engage, annotate articles, and provide feedback. If you wish to participate, you can still do so at https://wikiannotate.org/ (including if you have already annotated one or more batches). Feel free to also share this with others who could be interested. --TriggerOne (talk) 00:54, 26 May 2026 (UTC)[reply]

  • I've also tested out a batch, and I think similarly to LEvalyn I have encountered difficulties given the structure of the questions. The questions seems to presume that a non-existent source is a fabrication, but it can be hard to differentiate between eg. a webpage that is completely made up and one that once existed but no longer does. The question about images seems to assume that there are usable images, and that is often not the case. I am also unsure how LLMs are going to greatly affect image existence. I'm also curious about the MOS question, there's a couple of ways LLM text can obviously break the MOS, but human editing can do that too and if not diving into MOS minutiae, it's hard to get things like headers wrong. The options for source->text support lacks an answer for something like "Source exists but doesn't seem relevant to the statement at all". (I also left comments regarding a couple of the introductory survey questions in the answers there.)
    On ease of use for reviewers, it would be nice if the whole reference code was copied over into the question window, to save clicking the source number in the question window and then clicking the source in the text it jumps to to finally get to the actual source. It would also be nice if that when I accidentally close the tab (likely to happen when I'm opening and closing sometimes multiple tabs to access a source) reopening the window doesn't send me right back to the start of everything (everything including the introductory survey, let alone the batch review). CMD (talk) 12:46, 26 May 2026 (UTC)[reply]
  • I annotated a batch and found it an interesting exercise, but answering the questions took me an average of 25-30 minutes per article instead of the estimated 10 minutes. This is because I have to learn a fair bit about a topic to properly evaluate how comprehensive an article about it is, what the best quality sources would be, and whether the article is balanced. It also takes some work to track down whether a claim is properly sourced, and if not, whether it's a verifiable claim that just needs a better citation. I also fixed the problems I found as I went along, but that didn't take much additional time after I dug up the right information. Here are the diffs, in case this is interesting: Spongilla, Tate–LaBianca murders, Cambridge Central Mosque, Linguistic relativity, Attentional control. Dreamyshade (talk) 23:32, 28 May 2026 (UTC)[reply]
    Just in case it's useful feedback to the researchers, I'll comment that I took the survey and clicked through to see how labor-intensive the annotations were, and I agree that these look extremely labor intensive. 10 mins would be enough for a quick spot/gut check, but I think it would take me a lot longer to look up all of the cited sources, do quick searches of my own to determine if important sources are missing, etc. Suriname0 (talk) 21:27, 2 June 2026 (UTC)[reply]
    It depends on the length of the article a bit, one or two it gave me were little above stubs and had 2/3 sources. That said, I think that for some of the longer articles it showed me, the reason they were selected was because a certain section was edited by an eduWiki student rather than the entire article. Being asked only to check that section, and indeed having the selected source spotchecks derive from that section, would make reviewing more effective and likely provide better data. CMD (talk) 01:44, 3 June 2026 (UTC)[reply]

















Wikipedia:Wikipedia Signpost/2026-05-22/Forum