The Signpost

Opinion

Google isn't responsible for Wikipedia's mistakes

Zarasophos is currently working on everything related to Jadidism. The views expressed in this article are his alone and do not reflect any official opinions of this publication.
My work, Google's traffic

If you type "Rizaeddin bin Fakhreddin" into Google, Google will give you a list of links and a small box to the right. The first link will probably be to the English Wikipedia article on bin Fakhreddin, created and written by me; this can easily be checked by going into the page history of the article. But most likely you'll never bother to actually click on the article because of that small box to the right. "Rizaeddin bin Fakhreddin was a Tatar scholar and publicist that lived in the Russian Empire and the Soviet Union", it reads.

I typed that sentence. I also put the birth and death dates onto Wikipedia. I uploaded the picture to Wikimedia Commons and put it into the article – or articles, actually, because I also created the article on the German Wikipedia. But now I find this information directly on Google. There is a link to the Wikipedia article, but that may as well be a result of Father Google's omniscient mercy. Nowhere does the box state that it presents the work of an unpaid volunteer next to Google advertisements. The effect is obvious: In a 2017 study, half of the participants attributed what they found in the Knowledge Graph, which is the name of that small box, not to Wikipedia, but to Google.

Only good enough to blame

The Knowledge Graph has recently been in the news for saying that California Republicans are Nazis. The scandal was reported, discussed, closed, opened again and finally forgotten. Conservatives still think Google is biased against them; Google says the whole thing wasn't its fault.

We regret that vandalism on Wikipedia briefly appeared on our search results. This was not the the result of a manual change by Google.
— Google press release

No, obviously it wasn't. None of the content you presented there was. That was all Wikipedia's.

But the interesting thing is that in the public eye, this was still Google's fault. Read through the Twitter thread; none of the enraged commenters there seem to believe that this wasn't an action by a Google employee. "Google: Republicans are Nazis", read the headline on the Drudge Report article exposing the issue, and Wired magazine made a whole story out of making clear that the vandalism itself happened on Wikipedia. And all of that while more Wikipedia editors quickly did the dirty work; they hunted down the specific edit that caused the problem, corrected the vandalism and placed the page under semi-protection to prevent copycats. Meanwhile, the Knowledge Graph is still humming along, the ideology section removed, the rest still filled with Wikipedia data, and Google can be happy until the next scandal.

And we are left with a question: Why do we let this happen? Why do we let a multi-billion dollar company exploit us as uncredited mules – as long as there isn't a need for someone to shift the blame to? Where is the organization that should be responsible for protecting the rights of its volunteer editors – where is the WMF? Traditionally, Google is one of the biggest sponsors of the Foundation; for example, they chucked Jimmy Wales a $2m grant in 2010, more than they donated the whole last year. A few months later, they acquired the knowledge base Freebase, which was to form the basis for the Knowledge Graph, for an undisclosed sum.

Exploiters of free content should give back

After the recent scandal surfaced, the Foundation took an apologetic stance. "We're sorry", its statement seems to say, "and no, online encyclopedias still aren't a bad thing." But on 15 June, WMF executive director Katherine Maher, writing an opinion piece in Wired, saw the other side: "If Wikipedia is being asked to help hold back the ugliest parts of the internet, from conspiracy theories to propaganda, then the commons needs sustained, long-term support", she says, "The companies which rely on the standards we develop, the libraries we maintain, and the knowledge we curate should invest back. And they should do so with significant, long-term commitments that are commensurate with our value we create."

This is a step in the right direction. At the very least, the platform economies of the world should give something back to the largest source of the information they feed their algorithms with. As Maher concludes, "we shouldn’t be afraid to stand up for our value", but maybe it is time we see Google – and Facebook, and Amazon – not only as partners, but also as the ones making huge profits sustained by our unpaid labor.

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

My comments which are three months late

I haven't had time to read the signpost. I just found this. I was expecting to read about the problem of the knowledge graph having inaccurate information, which is a frequent complaint on the Help Desk and the Teahouse. This information is not on Wikipedia. I'm not sure where they got it. The person who complained is advised to give feedback to Google. I have done that about one particular mistake many times and gotten no results. Maybe it works for some people.— Vchimpanzee • talk • contributions • 20:36, 17 September 2018 (UTC)[reply]

Discussion that was already here

  • Isn't the Wikipedia link to the article minimal attribution? Also, the Google spiders that "catch and cage" the information update pretty quickly, it is these methods Google uses that appear to be the issue, not the vandalism on Wikipedia. Just my thought.--Mark Miller (talk) 01:43, 30 June 2018 (UTC)[reply]
  • It's certainly minimal, but I don't think it counts as a formal attribution. When it's in the form of some basic information followed by a Wikipedia link, the section reads more like "Here's some basic information, and a place you can read more about it," rather than "here's some information from this place." A better way for Google to frame the information would be "Rizaeddin bin Fakhreddin was a Tatar scholar and publicist who lived in the Russian Empire and the Soviet Union. (from Wikipedia)," perhaps also including the Creative Commons license display that is conspicuously absent. Aside, though this may be Fair Use, whatever happened to the ShareAlike part of the Wikipedia text's license? lethargilistic (talk) 07:12, 3 July 2018 (UTC)[reply]
  • It's alarming that Wikipedia's automated processes failed to catch and correct this glaringly obvious libel smear. But that's the thing: our automated processes aren't perfect. One minute we complain that ClueBot NG has too many false positives, the next we complain that it's not catching all the vandalism on Wikipedia. Every website in existence, everything that is written—whether it be by humans or by bots—is prone to error. We should never be viewing any source as perfect and infallible. And Wikipedia is no different in this regard. As for me, I've caught vandalism that have lingered on pages for months before. Vandalism is going to slip through, even with the best bots and the best patrollers active 24/7. Perhaps the public should be educated on how they can help fix this vandalism rather than gossiping and commenting unhelpfully about it. —k6ka 🍁 (Talk · Contributions) 04:05, 30 June 2018 (UTC)[reply]
  • Thanks for the article, Zarasophos. I agree with the sentiments, but I am not at all surprised. Google benefits from such unclear attribution most of the time, as that 2017 study demonstrated, so these occasional scandals are minor in comparison—especially when the blame can be momentarily shifted when they occur to mitigate even that minor damage. Those who dislike Google and suspect it of whatever agenda will continue to do so and welcome the scandals; the Google loyalists will remain unphased. Both will largely continue to use Google services anyway. In the end, Google wins at the expense of Wikipedia and we remain invisible all the same.
    Being invisible to the rest of the world as a Wikipedia editor is expected. In fact, at least when it comes to article tone, it's intended. I suspect that what is so frustrating about how Google has implemented the Knowledge Graph is not even the reinforcement of that invisibility, which is unsurprising even if disappointing; it is the fact that the relationship is so one-sided in addition to us being left uncredited. Were Google providing generous financial and programming support to Wikipedia and the Wikimedia Foundation, I doubt Wikipedians like us would be as bitter about the lack of credit; the benefits of the relationship would become pervasive and apparent. When that reciprocation amounts to less than 0.001% of Google's 2017 profits, though, the chagrin begins to show.
    Google is doing exactly as anyone familiar with how the economy works would expect Google to do. If anything, Google is already being gratuitous in its charity. After all, Google is being heralded as such a benevolent god for deigning to invest the equivalent of under 0.19% of its five-year profits into job training for tech companies and nonprofits (assuming zero revenue growth since 2017 and $25/hour for employee labor), which is a percentage that ancient elites would find laughable (and enviable!). The fact that it gave the Wikimedia Foundation alone the equivalent of roughly 0.1% of that investment last year might as well be a miracle.
    I have no faith that this trend will change without first radically changing the economy in which this all occurs, especially not in a way that is beneficial to paid and unpaid laborers alike. Google (or should I say Alphabet?) could be doing more to protect its hegemony, though. Avoiding vulnerability to criticism for errors it did not cause is one; better funding the database upon which it relies for a major feature is another. Both are easily achievable. The Wikimedia Foundation is committed to open knowledge and education, however, and copyright law does not favor Wikipedia in this battle, especially when the opponent is a titan like Google. Moreover, Google can survive the collapse of the Wikimedia Foundation, so its reliance is one of convenience; and even if that were not the case, when have corporations ever planned for long-term sustainability? Unless and until the relationships that characterize our present economy change, Google's—and Amazon's and Facebook's and everyone else's—relationship to Wikipedia probably will not either. —Nøkkenbuer (talkcontribs) 04:27, 30 June 2018 (UTC)[reply]
  • The increasing re-use of Wikipedia content by commercial companies, including Google and Facebook, is a major reason I avoid editing articles most likely to be used by these companies (eg, those on places and people). I don't want to donate my labour to any company if I can avoid it. Nick-D (talk) 11:11, 30 June 2018 (UTC)[reply]
  • @Nick-D: As a contrast to your experience, I have been spending quite a lot of time working on improving the coverage, quality, and accuracy of content in power station article infoboxes for the past year or so, and a significant part of my motivation for doing so was because of my desire to make the content more useful to third-party users such as Google (and although Google was certainly not the main type of third-party user I had in mind when I first started, I've realized since then that the benefits from my effort are quite clearly practically realized by Google far more than by any other type of third-party users).
I personally agree that the attribution to Wikipedia on Google search result pages containing content from Wikipedia in sidebar boxes on the search results page is unacceptably poorly done in its current form (an issue that has been bothering me for quite a while), but unfortunately since all Wikipedia content is dual-licensed under CC BY-SA 3.0 and the GFDL with the minimal explicitly specified attribution requirements being nothing more than a simple hyperlink, Google is technically already meeting the minimum obligations for attribution (although the fashion in which they do so is incredibly poorly done and doesn't even make it clear that they're attributing Wikipedia for content, let alone conveying the full scope of what content the attribution applies to — which honestly feels like an extremely insulting move on Google's part), and so sadly there is no real incentive for Google to bother with giving Wikipedia a more appropriate level of clearly defined and scoped attribution. While technically speaking Google does seem to actually currently be in violation of both GDFL and CC BY-SA 3.0 licensing terms due to their complete failure to comply with the requirements regarding copyright/licensing notices and potentially also those regarding redistribution licensing (as well as a few other related issues), I kinda doubt that they would take a complaint about these issues very seriously, and I'm not 100% sure that there isn't a loophole somewhere they could exploit to avoid these requirements for their particular use cases (I also personally don't really care too much about non-major violations of the relicensing terms as long as the rest of the requirements were complied with, although in this case, the rest of the requirements were seemingly not complied with, and so I am still annoyed about this because of that).
With regards to the issue of commercial use in general though, I have no problems with that as long as the attribution is clear, copyright/licensing notices were correctly included, and the redistribution of content does not grossly violate the relicensing terms. So if a company wants to benefit off of reusing content that I created or modified, they are more than welcome to go right ahead — if you aren't accepting of the fact that this type of reuse is allowed, then you shouldn't be editing Wikipedia at all. You don't have to be comfortable with it, but honestly, if you're volunteering your time to edit in order to improve the knowledge on here, shouldn't you be happy when said knowledge gets more exposure & usage? Or are you truly only happy as long as the exposure and usage exclusively happens on Wikimedia Foundation sites? Because that seems rather absurd to me. Actively avoiding making any edits that could potentially result in Google gaining more scrapable data is an utterly terrible idea if for no other reason than the fact that this adversely affects the quality of Wikipedia as a whole. Garzfoth (talk) 17:38, 30 June 2018 (UTC)[reply]
  • We should leave this be, unless we want Google and others to remove the data entirely. Yes, the link is attribution, and Wikipedia gets the initial credit which should be fine. If the information finds its way verbatim into books and textbooks hopefully the editors and publishers of those works will know enough to attribute, but search engines using short blurbs to help their readers best to allowed to play amongst our wording. Randy Kryn (talk) 15:21, 30 June 2018 (UTC)[reply]
I would just add that if we want to be very technical about what CC BY-SA allows, a licensing notice which explicitly states that the material being reused is available under CC BY-SA is also required in addition to the attribution with a hyperlink. The relevant policy is Wikipedia:Reusing Wikipedia content. With that being said, the Google Knowledge Graph data is generally only a short blurb, and the Wikipedia link makes it fairly clear where it comes from, so I don't think this is a big deal. Mz7 (talk) 20:13, 30 June 2018 (UTC)[reply]
  • "The companies which rely on the standards we develop, the libraries we maintain, and the knowledge we curate should invest back. And they should do so with significant, long-term commitments that are commensurate with our value." I can't imagine this being implemented without an implicit commitment from the Wikipedia community to provide content that is valuable to these companies, a burden that would still fall on the backs of unpaid editors. Donations are an appropriate way to reciprocate but we shouldn't be selling our content.
We also shouldn't take responsibility for how others choose to use our content. Wikipedia is remarkably accurate for an encyclopedia that anyone can edit, but it is not a reliable source and nobody should be republishing anything (from Wikipedia or elsewhere) without doing some basic fact-checking. Blaming Wikipedia for providing bad information wouldn't fly in a high school writing class and it sure as hell shouldn't fly at Google. Using an algorithm to do your heavy lifting does not change this.
We should continue to produce quality content while addressing vandalism, to meet Wikipedia's goals and nobody else's. Republishing with proper attribution doesn't create extra work for us, but it should be understood that what we write is provided "as is" with no guarantee of accuracy. –dlthewave 19:22, 1 July 2018 (UTC)[reply]
  • Oh and those who say that Google doesn't give back, doesn't 50% or so of Wikipedia's traffic come from Google? I think in the mission of spreading knowledge, Google is mostly an ally. --Felipe (talk) 03:04, 3 July 2018 (UTC)[reply]
  • Overall, the article makes some good points. However, I feel that several of the things written are completely wrong-headed.

    The article complains that if Google relies on Wikipedia, it should give something back. What do you call the millions of dollars in donation to the WMF? Of course, none of the money reaches the volunteers directly, but that's the entire model of Wikipedia: people are supposed to build the encyclopedia for free.

    Wikipedia and Google have a deeply symbiotic (some would say incestuous) relationship. Around 2005, Wikipedia pages started ranking quite high in Google (and other engines') search results. This brought a big influx of people to the site and arguably made what Wikipedia is today. Even today, much of the activity (especially in the "news" and "popular culture" categories) is driven by Google and other search engine traffic. In return Google gets a passably accurate, not-too-spammy site for people to direct to.

    I understand that people might feel ripped off by Google, but the state of affairs is inherent in the whole model of Wikipedia. Kingsindian   08:33, 3 July 2018 (UTC)[reply]

  • User:Mz7 despite what our policy says, a hyperlink alone seems sufficient. Indeed our own policy on reusing content within Wikipedia requires simply "copied content from [[page name]]; see that page's history for attribution" added to the edit summary, which is information that only appears to our readers if they view the history of the page and can figure out that it was that edit that provided that text. While the author of this article would like more personal credit, the reality on Wikipedia is that our pages have numerous authors, none of whom get more credit than another. It is a collaborative editing project and the purpose is not to create a free-to-read encyclopaedia but to create free content that can be re-used by others for any purpose.
If you consider the photographs I take, where I am sole author, Wikipedia re-uses them and there is no indication on the page whatsoever that I am the author or what licence it is used by (it is CC BY-SA 4.0, which is different to Wikipedia text). You have to know that clicking on the image will deliver the file-description page, and it is there that you will read the relevant attribution and licence details. If you Google for "Ravens of the Tower of London" you'll get a snippet from Wikipedia. The format is a bit different to the above example. It is more clear the text comes from Wikipedia. However the image is curious. If you click on that you get a Google Image page with text "Ravens of the Tower of London - Wikipedia". If you follow the "Visit" button it takes you to the Ravens of the Tower of London page. This is wrong. Firstly they are displaying a full-size image that did not actually come from that Wikipedia page (which only shows a thumbnail). But more importantly their page is the place where they should have the attribution and licence details. So to get to that information, you need to click on the thumb in the Google results, click on the "visit page" to get to Wikipedia, find my photo and know already that you can click on the photo, and then you get the attribution and licence terms. Google should fix that and properly link to the file-description page, which is where they got their image from. The problem is that there is minimal and there is best-practice, and Wikipedia already does minimal internally, so how can it persuade others that they should follow best practice? -- Colin°Talk 07:42, 5 July 2018 (UTC)[reply]
  • @Colin: When reusing that kind of content within Wikimedia Foundation projects such as the English Wikipedia, attribution requirements are minimal because all Wikimedia Foundation project contributors have already released their content under compatible licenses within the project (you agree to this with every edit you make), all project pages already include the full & correct copyright/licensing notices necessary for this type of reuse, and for the specific case of images, the author attribution info is always available by simply clicking on the image (both Mediaviewer and the local Wikipedia copy of/proxy to Wikimedia Commons images show the file's author & copyright information).
Google, in contrast, does not have the appropriate full copyright/licensing notices required under either of the licenses for Wikimedia Foundation project content. For example, if they were to choose to comply with the CC BY-SA 3.0 terms (as complying with GFDL terms would likely be impractical for their use cases), they would need to add a link to http://creativecommons.org/licenses/by-sa/3.0/, and I think they also need to specify the content's license (CC BY-SA 3.0) as well. Garzfoth (talk) 09:23, 5 July 2018 (UTC)[reply]
  • Garzfoth. The difference is not because Wikipedia has a special contract with contributors. The agreement with text contributors is CC BY-SA 3.0 & GFDL. This does not allow Wikipedia to reuse content any differently to how Google or any other reuser does (such as someone who clones Wikipedia). For the issue of images, the image creator often has no special agreement with Wikimedia at all: many of the images on Commons came from third parties (Flickr, etc) and were uploaded without the creator's knowledge.
So both Google and Wikipedia have identical requirements to attribute and display the licence details. Neither choose to do so on the page where the material is displayed. But at least Wikipedia does so on a page it hosts itself (e.g., a copy of the Commons file description page). Google instead relies on several jumps of third-party hyperlinks to satisfy the terms. I think that is dangerous practice because if someone removes my ravens photo from the article, and google continue to display it in their snippet, then their use of my image is unlicensed and so a copyright violation. If they linked directly to the Commons file description page, then that would be a bit safer. However even then, my image could be deleted from Commons (unlikely, but technically valid), or renamed. This is one reason Commons is reluctant to rename files, but it just comes from supporting bad practice. -- Colin°Talk 11:42, 5 July 2018 (UTC)[reply]
  • @Colin: The agreement between the contributor and the Wikimedia Foundation specifies that attribution via article hyperlink under the terms of the CC BY-SA 3.0 license is acceptable to the contributor (as well as use under GFDL, and alternatively attribution via two other alternative methods). Wikipedia pages already contain the required CC BY-SA 3.0 copyright/licensing notices that allow any content that is compatible with that license to be used within them (the "Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply." bit in the page footer). For the issue of rehosted images, those are rehosted under different terms from normal user-licensed images as you do not hold their copyright and there are already specific exceptions for these cases laid out in the Terms of Use as well as on each individual project.
The requirements differ depending on what is being reused. For Wikipedia text, this can only be reused under either GDFL or CC BY-SA 3.0, and Google is irrefutably not in compliance with either of those licensing terms. For images specifically, in the US and any other areas with equivalent copyright laws in this area, I believe that Google is technically protected by the fact that their use of said images can be considered "fair use" (which is an exemption that Creative Commons licenses respect). Also for images on their image search they are MUCH clearer about the fact that the image is potentially copyrighted content, and they don't seem to be rehosting full resolution content from Wikipedia/Commons both in image search results and in website search results. I am not quite sure if their reuse of text beyond the minima required for any basic generic short website summary in search results could possibly be considered "fair use" though...probably not, especially for use in knowledge graph... Garzfoth (talk) 13:24, 5 July 2018 (UTC)[reply]
Garzfoth, my argument is mostly about the images, where Wikipedia is not following best practice, but is a whole lot better than Google. For images on Wikipedia there is no contributor agreement that a hidden hyperlink is acceptable, but at least their hyperlink provides the goods. I think a fair use claim could be used by Google when the image appears as a snippet in a search results that clearly links to Wikipedia. Their fair use argument does not hold when they format the search results as an information box like the example in this article, where Google is effectively acting as an Encyclopaedia rather than web search engine. The don't mention the image is "potentially copyright content" in the search results at all, only when you click on the image and get the dark Google Images format page, and then that text is generic for all images they display. Their CC BY-SA requirement is for them to display attribution and licence details, which they don't. Expecting the user to hunt through Wikipedia to find such attribution and licence details is not acceptable imo, and liable to break when the article changes. For the Google Images page, they are hosting an enlarged image that does not come from the Wikipedia article thumb, so the CC BY-SA licence best practice is to state where they got the image from, which isn't the Wikipedia article, but the file description page which includes attribution and licence details. -- Colin°Talk 11:23, 7 July 2018 (UTC)[reply]
So, who wants to contact the WMF legal department and see if they wish to send a lawyer letter to Google about violating the the CC BY-SA license? --Guy Macon (talk) 13:59, 7 July 2018 (UTC)[reply]
WMF do not own the Wikipedia content or images. WMF legal represent WMF. The CC BY-SA violation, should it exist, is a legal issue between photographers, writers and Google, not WMF. So I don't think they would be involved. The most I've seen WMF legal do is give hints about certain interpretations of law wrt copyright. Perhaps you could get WMF legal to advise Jimbo about what he might want to say or write. But if Google are a big donor to Wikimedia, then don't hold your breath. -- Colin°Talk 22:07, 7 July 2018 (UTC)[reply]
  • The box itself is something that I like for two reasons: 1) I do use the Wikipedia link on the box often, specially when Wikipedia result is not the first one and 2) I wrote some articles were Google did not yield meaningful results in the first positions, now I am glad that it is easier for people to find more about the topic; on the other hand, on the vandalism thing, Google should take more responsibility, and if they think that vandalism is a problem, they should contribute code (they do have antivandalism code themselves) back to our antivandalism bots. I would prefer them to open source all their antivandalism stuff and help us in integrating it with our current systems, rather than taking yet another couple of millions of cash. MarioGom (talk) 07:01, 8 July 2018 (UTC)[reply]
  • I am honestly baffled. Isn't the whole point of free content that it can be used and reused without "giving back"? If we wanted people to "give back", we would have a stricter license, easy as that. I find it utterly weird that we write a free encyclopedia, and then complain when people use such free encyclopedia to do stuff. --cyclopiaspeak! 06:17, 10 July 2018 (UTC)[reply]

















Wikipedia:Wikipedia Signpost/2018-06-29/Opinion