The Signpost

File:Cloudflare office entrance area.jpg
HaeB, 2021
CC BY-SA 4.0
50
0
500
In the media

The "bigg" bosses: Robertsky and the Pope

Wikimedia Foundation wants more AI companies to pay for API access

Reuters reports that the Wikimedia Foundation is "working with Big Tech on deals similar to its arrangement with Google", in order to monetise AI companies' heavy reliance on Wikipedia content.

Speaking in an interview at the Reuters NEXT summit in New York, Wales said that tech companies' usage of freely available Wikipedia knowledge to train their large language models results in cost surges that Wikipedia's nonprofit operator must bear.

While not explicitly mentioned in the Reuters article, Wales was referring to the paid API access offered via the Foundation's for-profit subsidiary Wikimedia LLC as "Wikimedia Enterprise" (see prior Signpost coverage), per the separately published recording of the full interview. (Google was one of its first customers back in 2022, and Wikimedia Enterprise recently reported its "first complete year of profitability" and "that earned revenue has now fully repaid the initial investment in the project" since its inception in 2021.)

Wales said the small donations from the public that form the Foundation's primary source of income were not intended to underwrite the development of multibillion-dollar commercial AI products:

"Wikipedia is supported by volunteers. Those people are donating money to support Wikipedia, and not to subsidize OpenAI costing us a ton of money. That doesn't feel fair."

Asked whether any legal action was being contemplated, Wales replied:

"I don't know. I feel like our ability of soft power to just shame them is probably pretty powerful. Particularly because they are chock full of engineers who love Wikipedia and I don't think their staff are going to like it if ... it just doesn't feel good stealing from babies or Wikipedia."

Bernadette Meehan, due to take over as the new Wikimedia CEO next month, expressed similar sentiments in another Reuters piece:

"Wikipedia content is core to generative AI, but it's often lacking a clear attribution back to Wikimedia sites ... so, core to the conversations that I look forward to having is how to rectify that issue [...] Reuse (of Wikipedia content) is a real challenge. The idea is to help large-scale re-users get the content they need, but in a way that allows all of these Big Tech firms to contribute back to the ecosystem they depend on," Meehan said.

"Because what we don't want to do is change the free and open nature of Wikipedia for everyone else."

The Reuters article quoting Wales was titled "Wikipedia seeks more AI licensing deals similar to Google tie-up, co-founder Wales says". Here "licensing deals" is a rather peculiar wording choice, given that the Wikimedia Foundation does not own the copyright in Wikipedia's content (individual editors do) and thus cannot grant permissions (i.e. licenses) to AI companies for uses that are not already covered by Wikipedia's existing free license. The homepage of Wikimedia Enterprise also clarifies this:

Zero Licensing Fees
Over 99.9% of data available through Wikimedia Enterprise services is under a Creative Commons license, allowing you to put that data to work in the best way for your business.

One can't help noticing a certain conflict of interest on the Reuters journalists' part, given that their own employer was one of the first to strike actual AI licensing deals, more recently followed by other news publishers like CNN and Fox News (whereas several US courts have so far found that training LLM on copyrighted content is largely covered by fair use, a possibility already anticipated by WMF earlier, see our 2024 coverage: "AI policy positions of the Wikimedia Foundation").

Indeed, in his full response to the interview question that gave rise to the Reuters headline, Wales stressed this difference between the Foundation and such publishers, after briefly explaining Wikimedia Enterprise's offerings:

Interviewer: I know you've signed a deal with Google, which now pays you. Are you working on other deals with Big Tech?

Wales: Yeah, we are. So we've got our Enterprise product [where] you can buy a feed from Wikipedia. We actually have the ability to clean it up a bit [by filtering out recent vandalism edits]. [...]
Everything in Wikipedia is open source, freely licensed. [... So] we're in a different position from publishers who also have a lot of copyright concerns. For us, it's really okay, it's freely licensed, but one of the requirements is attribution. Also, just from an ethical point of view, attribution is really important. Like, we teach teenagers this. [...] And the large language models are pretty bad at attribution.

In other parts of the interview Wales discussed the negative impact of AI crawlers in more detail, largely echoing earlier statements by WMF about "How crawlers impact the operations of the Wikimedia projects". Both in the interview and in those earlier statements it remained unclear how much of this burden is due to "Big Tech" companies such as OpenAI (which professes to respect robots.txt with its "GPTBot" training data crawler), as opposed to smaller AI startups or rogue players. In a technical presentation earlier this year "on how the Wikimedia Foundation has been dealing with the problem of incoming traffic from AI companies' web crawlers", Giuseppe Lavagetto (Principal Site Reliability Engineer at WMF) had noted an "impersonators problem":

The majority of the traffic we see identified as "ChatGPTBot" or "ClaudeBot" or "Meta external agents" is coming from a myriad of small and large cloud providers and is NOT coming from OpenAI or Anthropic. This is true across all major crawlers.

In the presentation, Lavagetto described the strain AI crawlers put on WMF's caching infrastructure and its engineers ("Feels a lot like a never-ending game of whack-a-mole"), but also reported that he and his fellow WMF engineers had been quite successful in addressing it, advising other MediaWiki site owners that "7 magic lines of code will solve most of your crawler problems."

That said, the presentation's reassuring claim that WMF's method of blocking crawlers is "without any disruption to our real users" appears to have been a bit overconfident, considering a subsequent Wikitech-l thread where a longtime Wikipedia editor reported encountering broken images and map links. After some investigation, Lavagetto acknowledged that

"you were most likely caught in one of the filtering rules we've created to respond to [a disruptive AI crawler incident]. I am sorry you got caught up in that traffic, but at the time that was the only option we had to keep serving images to a good portion of asia and the americas."

In Wales' interview this month, he mentioned another possibility of dealing with such issues on a technical level:

Cloudflare's just come out with this product for publishers to block AI crawlers, because there's so many of them and there's so many ways they can do it. We aren't using that yet, [...and] whether we would do it ourselves or with Cloudflare, I don't know, but you know it's the sort of thing that we might consider.

TKTK
Will the receptionist let you pass? (Entrance to the Cloudflare offices in San Francisco, 2021)

Cloudflare is a company known among website operators for providing services such as protection from DDoS attacks (occasionally also to Wikimedia sites, see Signpost coverage) – and, among web users that Cloudflare suspects of being bots, for disrupting their internet experience with many captcha-like "I am human" checkboxes. Wales was referring to an announcement by Cloudflare CEO Matthew Prince from earlier this year that his company would repurpose this infrastructure to not just block DDoS attack botnets but also AI crawlers by default, unless their owners pay up in a "marketplace" to be created by Cloudflare – a not entirely unproblematic endeavour, as pointed out by e.g. veteran tech journalist Matthew Ingram ("Does Cloudflare want to protect the web or control it?"). A statement published by Creative Commons last week, while not naming Cloudflare explicitly, likewise raised numerous concerns about such "Pay-to-Crawl" systems.

Perhaps mindful of such concerns, Wales immediately qualified his remark about potentially considering Cloudflare's solution by stating that:

"We're a little bit funny because we're quite ideological about certain things. Free access to knowledge is really important to us. And so when we started the Enterprise product, we started charging for the API, but we have very generous limits for researchers and nonprofits and open source projects and things like that, and we'll continue that. But we do think there are these cases where it's like, "well you're really slamming our servers. You should be using the API and you aren't."

AK, H

Year in review

TKTK

Coverage of the most-read articles of 2025 was provided by The Guardian (UK). More audience-specific coverage came from Catholic News Agency, titled "Pope Leo XIV among the most viewed and searched on Wikipedia and Google in 2025".

2025 Wikipedia Year in Review is a personalized "wrapped" style list, only accessed via the Wikipedia mobile app. The feature was covered by Gizmodo, The Verge, techbuzz.ai, and Vice who called it "Fun, Fascinating, and a Little Bit Terrifying". – B

In brief

Robert Sim at Wikimania 2025, by Chlod, CC BY-SA 4.0



Do you want to contribute to "In the media" by writing a story or even just an "in brief" item? Edit next week's edition in the Newsroom or leave a tip on the suggestions page.


+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Thank you for the really excellent summary of the AI crawler issue. Not just repeating statements, but adding context and critical commentary all around, while keeping the piece concise and comprehensible. Well done. Toadspike [Talk] 22:41, 17 December 2025 (UTC)[reply]

Also, big congratulations to Robertsky! Toadspike [Talk] 22:43, 17 December 2025 (UTC)[reply]
Likewise to Robertsky!! :D Icepinner (Come to Hakurei Shrine!) 23:26, 17 December 2025 (UTC)[reply]
thanks. And thanks to the Signpost team! Never I had dream a moment that I would be on a Signpost headline. – robertsky (talk) 01:25, 18 December 2025 (UTC)[reply]

I don't understand. Jimmy Wales says that AI costs Wikipedia a lot of money. Why? How? Smallchief (talk) 23:19, 17 December 2025 (UTC)[reply]

Recommend reading the WMF Diff post "How crawlers impact the operations of the Wikimedia projects", especially the section under the heading "65% of our most expensive traffic comes from bots". ☆ Bri (talk) 23:51, 17 December 2025 (UTC)[reply]
Thanks, Bri. I'd like also to see a clear expression of how much of WMF's budget this AI poaching costs.Smallchief (talk)
...which was already linked (in this version) in this Signpost story, alongside the more technical presentation by one of the authors of that WMF Diff post.
As detailed here at the time, I found that Diff post rather disingenuous in several aspects. To add another observation, that "65% of our most expensive [i.e. non-cached] traffic comes from bots" datapoint should be seen in context:
  • Bots have always (long before GenAI) made up a substantial share of our overall pageviews. See [5] (click "Last Two Year" --> "All"), e.g. about 27% in December 2020, five years ago today, well before the AI boom. For comparison, that April 2025 Diff post noted that "overall pageviews from bots are about 35% of the total" around that time. 27% to 35% over 4+ years is an increase for sure, but not exactly dramatic. (Another wrinkle here is that shortly after that Diff post came out, it was found that WMF's pageview data was off by large amounts, see Signpost coverage - but let's assume for the sake of argument that the numbers in the Diff post were accurate.)
  • And as for bots being substantially overrepresented on non-cached pageviews (65% vs. 35%): That would likely be the case even if every crawler bot would be scrupulously well-behaved (e.g. by throttling its requests). The Diff post pretty much admits this: While human readers tend to focus on specific – often similar – topics, crawler bots tend to “bulk read” larger numbers of pages and visit also the less popular pages. Or to put it differently: If you check the traffic of one of our very long tail of low-traffic articles - say this one that I just got via Special:Random - you will likely find that its pageviews have consisted largely of bots ("spider"+"automated") for many years already. - Yet WMF chose to make this overrepresentation one of the centerpieces of its J'accuse Diff post and insinuated that it is a recent phenomenon (while stopping short of claiming this explicitly), without disclosing how much of an increase these 65% represented compared to earlier years. (And that Diff post has since been endlessly quoted in the media and elsewhere as "AI crawlers are killing Wikipedia" or such. To be sure, such exaggerations are not always WMF's fault. E.g. as we reported in the Signpost at the time, just a few weeks after that Diff post the Foundation saw it necessary to post a statement to correct some major misunderstandings by The Verge and Gizmodo.)
Back to Smallchief's questions: While these additional costs (and the frustrations of SREs who have put in extra work to combat misbehaving crawlers) are surely real, I haven't seen any attempt to actually quantify them. If anyone here has, let us know. It wouldn't surprise me if they represented only a very tiny percentage of WMF's overall budget.
Regards, HaeB (talk) 06:06, 18 December 2025 (UTC)[reply]

Re AI crawlers, I'm curious if Wikimedia Enterprise has looked into Really Simple Licensing (RSL), a (relatively) new open content licensing standard. I started an article about RSL on September 10, the same day it was launched, which unfortunately coincided with the Assassination of Charlie Kirk later that day so it might not have gotten as much attention as it would have otherwise. (I recognize that RSL would be of no help regarding crawlers who ignore terms in robots.txt files.) Funcrunch (talk) 01:47, 18 December 2025 (UTC)[reply]

RSL seems incompatible with the CC BY-SA license though? – robertsky (talk) 02:50, 18 December 2025 (UTC)[reply]
Unsure to be honest; this is not my primary area of expertise. Funcrunch (talk) 03:56, 18 December 2025 (UTC)[reply]

Re: John Stossel at RealClearPolitics. I would like to see more investigative analysis of Stossel and the way he is used as a source on Wikipedia. Many of his ideas have been discredited by experts. In the past, I have recommended that the use of Stossel as a source be closely curtailed, and in some instances removed. According to analysts, Stossel has spent decades engaging in climate denial, as only one example of many. Viriditas (talk) 02:20, 18 December 2025 (UTC)[reply]

The relevant paragraph from The Spectator interview is:

I [Angus Colwell] get the sense that stasis offends [David] Deutsch: at the end of The Beginning of Infinity, he says that 'stagnation does not make sense'. He speaks with genuine sorrow about a realisation he had about Wikipedia recently. On his computer, he keeps a list of examples of progress happening that he hadn't thought was achievable. The world wide web is one example: 'I couldn't imagine that the authors of all the trillions of documents in the world would go to the trouble of converting them all to HTML.' Yet earlier this year, he had to delete Wikipedia from the list. 'I originally thought it couldn't possibly work, and then I added it to the list when it appeared to work well for years. Now... it's succumbed to the woke plague. I've avoided using it for several years now, and I crossed it off the list.' He sighs. 'I was right the first time.'[1]

So perhaps not as interesting as the authors might have expected. 5225C (talk • contributions) 02:30, 18 December 2025 (UTC)[reply]

References

  1. ^ Colwell, Angus (13 December 2025). Gove, Michael (ed.). "Infinite wisdom: An interview with the physicist David Deutsch". The Spectator Australia. Vol. 359, no. 10, 294–6. pp. 32–33. ISSN 0038-6952. Retrieved 18 December 2025.


















Wikipedia:Wikipedia Signpost/2025-12-17/In_the_media