The Signpost


In the media

The "bigg" bosses: Robertsky and the Pope

Wikimedia Foundation wants more AI companies to pay for API access

[edit]

Reuters reports that the Wikimedia Foundation is "working with Big Tech on deals similar to its arrangement with Google", in order to monetise AI companies' heavy reliance on Wikipedia content.

Speaking in an interview at the Reuters NEXT summit in New York, Wales said that tech companies' usage of freely available Wikipedia knowledge to train their large language models results in cost surges that Wikipedia's nonprofit operator must bear.

While not explicitly mentioned in the Reuters article, Wales was referring to the paid API access offered via the Foundation's for-profit subsidiary Wikimedia LLC as "Wikimedia Enterprise" (see prior Signpost coverage), per the separately published recording of the full interview. (Google was one of its first customers back in 2022, and Wikimedia Enterprise recently reported its "first complete year of profitability" and "that earned revenue has now fully repaid the initial investment in the project" since its inception in 2021.)

Wales said the small donations from the public that form the Foundation's primary source of income were not intended to underwrite the development of multibillion-dollar commercial AI products:

"Wikipedia is supported by volunteers. Those people are donating money to support Wikipedia, and not to subsidize OpenAI costing us a ton of money. That doesn't feel fair."

Asked whether any legal action was being contemplated, Wales replied:

"I don't know. I feel like our ability of soft power to just shame them is probably pretty powerful. Particularly because they are chock full of engineers who love Wikipedia and I don't think their staff are going to like it if ... it just doesn't feel good stealing from babies or Wikipedia."

Bernadette Meehan, due to take over as the new Wikimedia CEO next month, expressed similar sentiments in another Reuters piece:

"Wikipedia content is core to generative AI, but it's often lacking a clear attribution back to Wikimedia sites ... so, core to the conversations that I look forward to having is how to rectify that issue [...] Reuse (of Wikipedia content) is a real challenge. The idea is to help large-scale re-users get the content they need, but in a way that allows all of these Big Tech firms to contribute back to the ecosystem they depend on," Meehan said.

"Because what we don't want to do is change the free and open nature of Wikipedia for everyone else."

The Reuters article quoting Wales was titled "Wikipedia seeks more AI licensing deals similar to Google tie-up, co-founder Wales says". Here "licensing deals" is a rather peculiar wording choice, given that the Wikimedia Foundation does not own the copyright in Wikipedia's content (individual editors do) and thus cannot grant permissions (i.e. licenses) to AI companies for uses that are not already covered by Wikipedia's existing free license. The homepage of Wikimedia Enterprise also clarifies this:

Zero Licensing Fees
Over 99.9% of data available through Wikimedia Enterprise services is under a Creative Commons license, allowing you to put that data to work in the best way for your business.

One can't help noticing a certain conflict of interest on the Reuters journalists' part, given that their own employer was one of the first to strike actual AI licensing deals, more recently followed by other news publisher like CNN and Fox News (whereas several US courts have so far found that training LLM on copyrighted content is largely covered by fair use, a possibility already anticipated by WMF earlier, see our 2024 coverage: "AI policy positions of the Wikimedia Foundation").

Indeed, in his full response to the interview question that gave rise to the Reuters headline, Wales stressed this difference between the Foundation and such publishers, after briefly explaining Wikimedia Enterprise's offerings:

Interviewer: I know you've signed a deal with Google, which now pays you. Are you working on other deals with Big Tech?

Wales: Yeah, we are. So we've got our Enterprise product [where] you can buy a feed from Wikipedia. We actually have the ability to clean it up a bit [by filtering out recent vandalism edits]. [...]

Everything in Wikipedia is open source, freely licensed. [... So] we're in a different position from publishers who also have a lot of copyright concerns. For us, it's really okay, it's freely licensed, but one of the requirements is attribution. Also, just from an ethical point of view, attribution is really important. Like, we teach teenagers this. [...] And the large language models are pretty bad at attribution.

In other parts of the interview Wales discussed the negative impact of AI crawlers in more detail, largely echoing earlier statements by WMF about "How crawlers impact the operations of the Wikimedia projects". Both in the interview and in those earlier statements it remained unclear how much of this burden is due to "Big Tech" companies such as OpenAI (which professes to respect robots.txt with its "GPTBot" training data crawler), as opposed to smaller AI startups or rogue players. In a technical presentation earlier this year "on how the Wikimedia Foundation has been dealing with the problem of incoming traffic from AI companies' web crawlers", Giuseppe Lavagetto (Principal Site Reliability Engineer at WMF) had noted an "impersonators problem":

The majority of the traffic we see identified as "ChatGPTBot" or "ClaudeBot" or "Meta external agents" is coming from a myriad of small and large cloud providers and is NOT coming from OpenAI or Anthropic. This is true across all major crawlers.

In the presentation, Lavagetto described the strain AI crawlers put on WMF's caching infrastructure and its engineers ("Feels a lot like a never-ending game of whack-a-mole"), but also reported that he and his fellow WMF engineers had been quite successful in addressing it, advising other MediaWiki site owners that "7 magic lines of code will solve most of your crawler problems."

That said, the presentation's reassuring claim that WMF's method of blocking crawlers is "without any disruption to our real users" appears to have been a bit overconfident, considering a subsequent Wikitech-l thread where a longtime Wikipedia editor reported encountering broken images and map links. After some investigation, Lavagetto acknowledged that

"you were most likely caught in one of the filtering rules we've created to respond to [a disruptive AI crawler incident]. I am sorry you got caught up in that traffic, but at the time that was the only option we had to keep serving images to a good portion of asia and the americas."

In Wales' interview this month, he mentioned another possibility of dealing with such issues on a technical level:

Cloudflare's just come out with this product for publishers to block AI crawlers, because there's so many of them and there's so many ways they can do it. We aren't using that yet, [...and] whether we would do it ourselves or with Cloudflare, I don't know, but you know it's the sort of thing that we might consider.

TKTK
Will the receptionist let you pass? (Entrance to the Cloudflare offices in San Francisco, 2021)

Cloudflare is a company known among website operators for providing services such as protection from DDoS attacks (occasionally also to Wikimedia sites, see Signpost coverage) – and, among web users that Cloudflare suspects of being bots, for disrupting their internet experience with many captcha-like "I am human" checkboxes. Wales was referring to an announcement by Cloudflare CEO Matthew Prince from earlier this year that his company would repurpose this infrastructure to not just block DDoS attack botnets but also AI crawlers by default, unless their owners pay up in a "marketplace" to be created by Cloudflare – a not entirely unproblematic endeavour, as pointed out by e.g. veteran tech journalist Matthew Ingram ("Does Cloudflare want to protect the web or control it?"). A statement published by Creative Commons last week, while not naming Cloudflare explicitly, likewise raised numerous concerns about such "Pay-to-Crawl" systems.

Perhaps mindful of such concerns, Wales immediately qualified his remark about potentially considering Cloudflare's solution by stating that:

"We're a little bit funny because we're quite ideological about certain things. Free access to knowledge is really important to us. And so when we started the Enterprise product, we started charging for the API, but we have very generous limits for researchers and nonprofits an open source projects and things like that, and we'll continue that. But we do think there are these cases where it's like, "well you're really slamming our servers. You should be using the API and you aren't."

AK, H

Year in review

[edit]
TKTK

Coverage of the most-read articles of 2025 was provided by The Guardian (UK). More audience-specific coverage came from Catholic News Agency, titled "Pope Leo XIV among the most viewed and searched on Wikipedia and Google in 2025".

2025 Wikipedia Year in Review is a personalized "wrapped" style list, only accessed via the Wikipedia mobile app. The feature was covered by Gizmodo, The Verge, techbuzz.ai, and Vice who called it "Fun, Fascinating, and a Little Bit Terrifying". – B

In brief

[edit]
Robert Sim at Wikimania 2025, by Chlod, CC BY-SA 4.0



Do you want to contribute to "In the media" by writing a story or even just an "in brief" item? Edit next week's edition in the Newsroom or leave a tip on the suggestions page.

This page is a draft for the next issue of the Signpost. Below is some helpful code that will help you write and format a Signpost draft. If it's blank, you can fill out a template by copy-pasting this in and pressing 'publish changes': {{subst:Wikipedia:Wikipedia Signpost/Templates/Story-preload}}


Signpost
In this issue
+ Add a comment

Discuss this story

To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
No comments yet. Yours could be the first!





















Wikipedia:Wikipedia Signpost/Next_issue/In_the_media