This is a draft of a potential Signpost article, and should not be interpreted as a finished piece. Its content is subject to review by the editorial team and ultimately by JPxG, the editor in chief. Please do not link to this draft as it is unfinished and the URL will change upon publication. If you would like to contribute and are familiar with the requirements of a Signpost article, feel free to be bold in making improvements! · next-next issue draft
| |||||
Reuters reports that the Wikimedia Foundation is "working with Big Tech on deals similar to its arrangement with Google", in order to monetise AI companies' heavy reliance on Wikipedia content.
Speaking in an interview at the Reuters NEXT summit in New York, Wales said that tech companies' usage of freely available Wikipedia knowledge to train their large language models results in cost surges that Wikipedia's nonprofit operator must bear.
While not explicitly mentioned in the Reuters article, Wales was referring to the paid API access offered via the Foundation's for-profit subsidiary Wikimedia LLC as "Wikimedia Enterprise" (see prior Signpost coverage), per the separately published recording of the full interview. (Google was one of its first customers back in 2022, and Wikimedia Enterprise recently reported its "first complete year of profitability" and "that earned revenue has now fully repaid the initial investment in the project" since its inception in 2021.)
Wales said the small donations from the public that form the Foundation's primary source of income were not intended to underwrite the development of multibillion-dollar commercial AI products:
"Wikipedia is supported by volunteers. Those people are donating money to support Wikipedia, and not to subsidize OpenAI costing us a ton of money. That doesn't feel fair."
Asked whether any legal action was being contemplated, Wales replied:
"I don't know. I feel like our ability of soft power to just shame them is probably pretty powerful. Particularly because they are chock full of engineers who love Wikipedia and I don't think their staff are going to like it if ... it just doesn't feel good stealing from babies or Wikipedia."
Bernadette Meehan, due to take over as the new Wikimedia CEO next month, expressed similar sentiments in another Reuters piece:
"Wikipedia content is core to generative AI, but it's often lacking a clear attribution back to Wikimedia sites ... so, core to the conversations that I look forward to having is how to rectify that issue [...] Reuse (of Wikipedia content) is a real challenge. The idea is to help large-scale re-users get the content they need, but in a way that allows all of these Big Tech firms to contribute back to the ecosystem they depend on," Meehan said.
"Because what we don't want to do is change the free and open nature of Wikipedia for everyone else."
The Reuters article quoting Wales was titled "Wikipedia seeks more AI licensing deals similar to Google tie-up, co-founder Wales says". Here "licensing deals" is a rather peculiar wording choice, given that the Wikimedia Foundation does not own the copyright in Wikipedia's content (individual editors do) and thus cannot grant permissions (i.e. licenses) to AI companies for uses that are not already covered by Wikipedia's existing free license. The homepage of Wikimedia Enterprise also clarifies this:
Zero Licensing Fees
Over 99.9% of data available through Wikimedia Enterprise services is under a Creative Commons license, allowing you to put that data to work in the best way for your business.
One can't help noticing a certain conflict of interest on the Reuters journalists' part, given that their own employer was one of the first to strike actual AI licensing deals, more recently followed by other news publisher like CNN and Fox News (whereas several US courts have so far found that training LLM on copyrighted content is largely covered by fair use, a possibility already anticipated by WMF earlier, see our 2024 coverage: "AI policy positions of the Wikimedia Foundation").
Indeed, in his full response to the interview question that gave rise to the Reuters headline, Wales stressed this difference between the Foundation and such publishers, after briefly explaining Wikimedia Enterprise's offerings:
Interviewer: I know you've signed a deal with Google, which now pays you. Are you working on other deals with Big Tech?
Wales: Yeah, we are. So we've got our Enterprise product [where] you can buy a feed from Wikipedia. We actually have the ability to clean it up a bit [by filtering out recent vandalism edits]. [...]
Everything in Wikipedia is open source, freely licensed. [... So] we're in a different position from publishers who also have a lot of copyright concerns. For us, it's really okay, it's freely licensed, but one of the requirements is attribution. Also, just from an ethical point of view, attribution is really important. Like, we teach teenagers this. [...] And the large language models are pretty bad at attribution.
In other parts of the interview Wales discussed the negative impact of AI crawlers in more detail, largely echoing earlier statements by WMF about "How crawlers impact the operations of the Wikimedia projects". Both in the interview and in those earlier statements it remained unclear how much of this burden is due to "Big Tech" companies such as OpenAI (which professes to respect robots.txt with its "GPTBot" training data crawler), as opposed to smaller AI startups or rogue players. In a technical presentation earlier this year "on how the Wikimedia Foundation has been dealing with the problem of incoming traffic from AI companies' web crawlers", Giuseppe Lavagetto (Principal Site Reliability Engineer at WMF) had noted an "impersonators problem":
The majority of the traffic we see identified as "ChatGPTBot" or "ClaudeBot" or "Meta external agents" is coming from a myriad of small and large cloud providers and is NOT coming from OpenAI or Anthropic. This is true across all major crawlers.
In the presentation, Lavagetto described the strain AI crawlers put on WMF's caching infrastructure and its engineers ("Feels a lot like a never-ending game of whack-a-mole"), but also reported that he and his fellow WMF engineers had been quite successful in addressing it, advising other MediaWiki site owners that "7 magic lines of code will solve most of your crawler problems."
That said, the presentation's reassuring claim that WMF's method of blocking crawlers is "without any disruption to our real users" appears to have been a bit overconfident, considering a subsequent Wikitech-l thread where a longtime Wikipedia editor reported encountering broken images and map links. After some investigation, Lavagetto acknowledged that
"you were most likely caught in one of the filtering rules we've created to respond to [a disruptive AI crawler incident]. I am sorry you got caught up in that traffic, but at the time that was the only option we had to keep serving images to a good portion of asia and the americas."
In Wales' interview this month, he mentioned another possibility of dealing with such issues on a technical level:
Cloudflare's just come out with this product for publishers to block AI crawlers, because there's so many of them and there's so many ways they can do it. We aren't using that yet, [...and] whether we would do it ourselves or with Cloudflare, I don't know, but you know it's the sort of thing that we might consider.
Cloudflare is a company known among website operators for providing services such as protection from DDoS attacks (occasionally also to Wikimedia sites, see Signpost coverage) – and, among web users that Cloudflare suspects of being bots, for disrupting their internet experience with many captcha-like "I am human" checkboxes. Wales was referring to an announcement by Cloudflare CEO Matthew Prince from earlier this year that his company would repurpose this infrastructure to not just block DDoS attack botnets but also AI crawlers by default, unless their owners pay up in a "marketplace" to be created by Cloudflare – a not entirely unproblematic endeavour, as pointed out by e.g. veteran tech journalist Matthew Ingram ("Does Cloudflare want to protect the web or control it?"). A statement published by Creative Commons last week, while not naming Cloudflare explicitly, likewise raised numerous concerns about such "Pay-to-Crawl" systems.
Perhaps mindful of such concerns, Wales immediately qualified his remark about potentially considering Cloudflare's solution by stating that:
"We're a little bit funny because we're quite ideological about certain things. Free access to knowledge is really important to us. And so when we started the Enterprise product, we started charging for the API, but we have very generous limits for researchers and nonprofits an open source projects and things like that, and we'll continue that. But we do think there are these cases where it's like, "well you're really slamming our servers. You should be using the API and you aren't."
Coverage of the most-read articles of 2025 was provided by The Guardian (UK). More audience-specific coverage came from Catholic News Agency, titled "Pope Leo XIV among the most viewed and searched on Wikipedia and Google in 2025".
2025 Wikipedia Year in Review is a personalized "wrapped" style list, only accessed via the Wikipedia mobile app. The feature was covered by Gizmodo, The Verge, techbuzz.ai, and Vice who called it "Fun, Fascinating, and a Little Bit Terrifying". – B
This page is a draft for the next issue of the Signpost. Below is some helpful code that will help you write and format a Signpost draft. If it's blank, you can fill out a template by copy-pasting this in and pressing 'publish changes': {{subst:Wikipedia:Wikipedia Signpost/Templates/Story-preload}}
Images and Galleries
|
|---|
To put an image in your article, use the following template (link): This will create the file on the right. Keep the 300px in most cases. If writing a 'full width' article, change
Placing (link) will instead create an inline image like below The significant thing is feeling, as such, quite apart from the environment in which it is called forth.
To create a gallery, use the following Each line inside the tags should be formatted like
If you want it centered, remove t |
Quotes
| ||||||
|---|---|---|---|---|---|---|
To insert a framed quote like the one on the right, use this template (link): If writing a 'full width' article, change
To insert a pull quote like
use this template (link):
To insert a long inline quote like
use this template (link): | ||||||
Side frames
|
|---|
|
Side frames help put content in sidebar vignettes. For instance, this one (link): gives the frame on the right. This is useful when you want to insert non-standard images, quotes, graphs, and the like. If writing a 'full width' article, change |
Two-column vs full width styles
|
|---|
|
If you keep the 'normal' preloaded draft and work from there, you will be using the two-column style. This is perfectly fine in most cases and you don't need to do anything. However, every time you have a However, you can also fine-tune which style is used at which point in an article. To switch from two-column → full width style midway in an article, insert where you want the switch to happen. To switch from full width → two-column style midway in an article, insert where you want the switch to happen. |
Article series
|
|---|
|
To add a series of 'related articles' your article, use the following code or will create the sidebar on the right. If writing a 'full width' article, change Alternatively, you can use at the end of an article to create
If you think a topic would make a good series, but you don't see a tag for it, or that all the articles in a series seem 'old', ask for help at the WT:NEWSROOM. Many more tags exist, but they haven't been documented yet. |
Links and such
|
|---|
By the way, the template that you're reading right now is {{Editnotices/Group/Wikipedia:Wikipedia Signpost/Next issue}} (edit). A list of the preload templates for Signpost articles can be found here. |
Discuss this story