The Signpost


Op-ed

Whither Wikidata?

Wikidata, a Wikimedia project spearheaded by Wikimedia Deutschland, recently celebrated its third anniversary. The project has a dual purpose: 1. Streamline data housekeeping within Wikipedia. 2. Serve as a data source for re-users on the web; in particular, Wikidata is the designated successor to Google's Freebase, designed to deliver data for the Google Knowledge Graph.

We need to talk about Wikidata.

Wikidata, covered in last week's Signpost issue in a celebratory op-ed that highlighted the project's potential (see Wikidata: the new Rosetta Stone), has some remarkable properties for a Wikimedia wiki:

  • A little more than half its statements are unreferenced.
  • Of those statements that do have a reference, significantly more than half are referenced only to a language version of Wikipedia (projects like the English, Latvian or Burmese Wikipedia).
  • Wikidata statements referenced to Wikipedia do not cite a specific article version, but only name the Wikipedia in question.
  • Wikidata has a no-attribution CC0 licence; this means that third parties can use the data on their sites without indicating their provenance, obscuring the fact that the data came from a crowdsourced project subject to the customary disclaimers.
  • Hoaxes long extinguished on Wikipedia live on, zombie-like, in Wikidata.

This op-ed examines the situation and its implications, and suggests corrective action.

But first ...

A little bit of history

Wikidata is one of the younger Wikimedia projects. Launched in 2012, the project's development has not been led by the Wikimedia Foundation itself, but by Wikimedia Deutschland, the German Wikimedia chapter.

The initial development work was funded by a donation of 1.3 million Euros, made up of three components:

The original team of developers was led by Denny Vrandečić (User:Denny), who came to Wikimedia Deutschland from the Karlsruhe Institute of Technology (KIT). Vrandečić was, together with Markus Krötzsch (formerly KIT, University of Oxford, presently Dresden University of Technology), the founder of the Semantic MediaWiki project. Since 2013, Vrandečić has been a Google employee; in addition, since summer 2015 he has been one of the three community-elected Wikimedia Foundation board members.

Microsoft co-founder Paul Allen's Institute for Artificial Intelligence provided half the funding for the initial development of Wikidata

Wikimedia Deutschland's original press release, dated 30 March 2012, said,

Wikidata thus has a dual purpose: it is designed to make housekeeping across the various Wikipedia language versions easier, and to serve as a one-stop data shop for sundry third parties.

To maximise third-party re-use, Wikidata—unlike Wikipedia—is published under the CC0 1.0 Universal licence, a complete public domain dedication that waives all author's rights, to the extent allowed by law. This means that re-users of Wikidata content are not obliged to indicate the source of the data to their readers.

In this respect Wikidata differs sharply from Wikipedia, which is published under the Creative Commons Attribution-ShareAlike 3.0 Unported Licence, requiring re-users of Wikipedia content to credit Wikipedia (attribution) and to distribute copies and adaptations of Wikipedia content only under the same licence (Share-alike).

Search engines take on a new role as knowledge providers

Google contributed a quarter of the initial funding for the development of Wikidata, which is now replacing Freebase as one of the sources for the Google Knowledge Graph

The March 30, 2012 announcement of the development of Wikidata was followed six weeks later, on May 16, 2012, by the arrival of a new Google feature destined to have far-reaching implications: the Google Knowledge Graph. Similar developments also happened at Microsoft's Bing. These two major search engines, no longer content to simply provide users with a list of links to information providers, declared that they wanted to become information providers in their own right.

The Google Knowledge Graph, Google said, would enable Internet users

The move makes sense from a business perspective: by trying to guess the information in which people are interested and making that information available on their own pages, search engines can entice users to stay on their sites for longer, increasing the likelihood that they will click on an ad—a click that will add to the search engine's revenue (in Google's case running at around $200 million a day).

Moreover, search engine results pages that do not include a Knowledge Graph infobox often feature ads in the same place where the Knowledge Graph is usually displayed: the right-hand side of the page. The Knowledge Graph thus trains users to direct their gaze to the precise part of a search engine results page that generates the operator's revenue. Alternatively, ads may also be (and have been) inserted directly into the Knowledge Graph itself.

Microsoft's Bing search engine has followed much the same path as Google with its "Snapshot" feature drawing on Wikimedia content
Microsoft's Bing followed a very similar development from 2012 onwards, with Bing's Satori-powered "Snapshot" feature closely mimicking the appearance and content of Google's Knowledge Graph. Bing has used some of the same sources as Google, in particular Wikipedia and Freebase, a crowdsourced database published under a Creative Commons Attribution Licence that was acquired by Google in 2010.

Neither Freebase nor Wikipedia really profited from this development. Wikipedia noted a significant downturn in pageviews that was widely attributed to the introduction of the Google Knowledge Graph, causing worries among Wikimedia fundraisers and those keen to increase editor numbers. After all, Internet users not clicking through to Wikipedia would miss both the Wikimedia Foundation's fundraising banners and a chance to become involved in Wikipedia themselves.

As for Freebase, Google announced in December 2014, a little over four years after acquiring the project, that it would shut it down in favour of the more permissively licensed Wikidata and migrate its content to Wikidata—Freebase's different Creative Commons licence, which required attribution, notwithstanding.

"The PR Pros and SEOs are Coming"

Freebase was widely considered a weak link in the information supply chain ending at the Knowledge Graph. Observers noted that search engine optimization (SEO) specialists were able to manipulate the Knowledge Graph by manipulating Freebase.

In a Wikidata Office Chat conducted on March 31, 2015, future Wikimedia Foundation board member Denny Vrandečić—juggling his two hats as a Google employee and the key thought leader of Wikimedia's Wikidata project—spoke about Google's transition from Freebase to Wikidata, explaining that Wikidata's role would be slightly different from the role played by Freebase:

Denny Vrandečić, the co-founder of the Semantic MediaWiki project, has to juggle three hats: he is a Google employee as well as a community-elected Wikimedia Foundation board member and the primary Wikidata thought leader

Noam Shapiro, writing in Search Engine Journal, drew the following conclusions from his review of this chat, focusing on the statements highlighted in yellow above:

Shapiro's point concerning spam and bias mentioned "the need for recognized references". This is a topic that we will shortly return to, because Wikidata seems to have adopted a very lax approach to this requirement.

The relationship between Wikidata and Wikipedia: Sources? What sources?

Citations to Wikipedia (blue) outnumber all other sources (red) together (yellow = unreferenced)

The fact that Wikidata and Wikipedia have what seems on the face of it incompatible licences has been a significant topic of discussion within the Wikimedia community. It is worth noting that in 2012, Denny Vrandečić wrote on Meta,

More recently, the approach seems to have been that because facts cannot be copyrighted, mass imports from Wikipedia are justified. The legal situation concerning database rights in the US and EU is admittedly fairly complex. At any rate, whatever licensing qualms Denny may have had about this issue at the time seem to have evaporated. If the original plan was indeed "not [...] to extract content out of Wikipedia at all", then the plan changed.

Bot imports from Wikipedia have long been the order of the day. In fact, in recent months contributors on Wikidata have repeatedly raised alarms about mass imports of content from various Wikipedias, believing that these imports compromise quality (the following quote, written by a non-native speaker, has been lightly edited for spelling and grammar):

The circular reference loop connecting Wikidata and Wikipedia

The result of these automated imports is that Wikipedia is today by far the most commonly cited source in Wikidata.

According to current Wikimedia statistics:

  • Half of all statements in Wikidata are completely unreferenced.
  • Close to a third of all statements in Wikidata are only referenced to a Wikipedia.

References to a Wikipedia do not identify a specific article version; they simply name the language version of Wikipedia. This includes many minor language versions whose referencing standards are far less mature than those of the English Wikipedia. Moreover, some Wikipedia language versions, like the Croatian and Kazakh Wikipedias, are not just less mature, but are known to have very significant problems with political manipulation of content.

Recall Shapiro's expectation above that spam and bias would be held at bay by the "need for recognized references". Wikidata's current referencing record seems unlikely to live up to that expectation.

Of course, allowances probably have to be made for the fact that some statements in Wikidata may genuinely not be in need of a reference. For example, in a Wikidata entry like George Bernard Shaw, one might expect to receive some sympathy for the argument that the statement "Given name: George" is self-evident and does not need a reference. Wikidata, some may argue, will never need to have 100 per cent of its statements referenced.

However, it does not seem healthy for Wikipedia to be cited more often in Wikidata than all other types of sources together. This is all the more important as Wikidata may not just propagate errors to Wikipedia, but may also spread them to the Google Knowledge Graph, Bing's Snapshot, myriad other re-users of Wikidata content, and thence to "reliable sources" cited in Wikipedia, completing the "citogenesis" loop.

Data are not truth: sometimes they are phantoms

Citogenesis

As the popularity of Wikipedia has soared, citogenesis has been a real problem in the interaction between "reliable sources" and Wikipedia. A case covered in May 2014 in The New Yorker provides an illustration:

It seems inevitable that falsehoods of this kind will be imported into Wikidata, eventually infecting both other Wikipedias and third-party sources. That this not only can, but does happen is quickly demonstrated. Among the top fifteen longest-lived hoaxes currently listed at Wikipedia:List of hoaxes, six (nos. 1, 2, 6, 7, 11 and 13) still have active Wikidata entries at the time of writing. The following table reproduces the corresponding entries in Wikipedia:List of hoaxes, with a column identifying the relevant Wikidata item and supplementary notes added:

Hoax Length Start date End date Links Wikidata item
Jack Robichaux
Fictional 19th‑century serial rapist in New Orleans
10 years,
1 month
July 31, 2005 September 3, 2015 Wikipedia:Articles for deletion/Jack Robichaux https://archive.is/Z6Gne Note: The English Wikipedia link has been updated.
Guillermo Garcia
"Highly influential" but imaginary oil and forestry magnate in 18th-century South America
9 years,
10 months
November 17, 2005 September 19, 2015 Wikipedia:Articles for deletion/Guillermo Garcia (businessman) https://archive.is/0pprA
Gregory Namoff
An "internationally known" but nonexistent investment banker, minor Watergate figure, and U.S. Senate candidate.
9 years,
6½ months
June 17, 2005 January 13, 2015 Wikipedia:Articles for deletion/Gregory Namoff Archive https://archive.is/urElB Note: 10 months after the hoax article was deleted on Wikipedia, a user added "natural causes" as the manner of death on Wikidata
Double Hour
Supposed German and American television show, covering historic events over a two-hour span.
9 years,
6 months
September 23, 2005 April 4, 2015 Double Hour (TV series) deletion log https://archive.is/rjFjw Note: This item has only ever been edited by bots.
Nicholas Burkhart
Fictitious 17th-century legislator in the House of Keys on the Isle of Man.
9 years,
2 months
July 19, 2006 September 26, 2015 Wikipedia:Articles for deletion/Nicholas Burkhart https://archive.is/A0lt7
Emilia Dering
Long-lived article about a non-existent 19th century German poet started with the rather basic text "Emilia Dering is a famous poet who was Berlin,Germany on April 16, 1885" by a single-purpose account
8 years,
10 months
December 6, 2006 October 6, 2015 Emilia Dering deletion log; deleted via A7. On the day of the article's creation, a person claiming to be the granddaughter of Emilia Dering published a blog post with a poem supposedly written by her. https://archive.is/eNJbc

Using the last entry from the above list as an example, a Google search quickly demonstrates that there are dozens of other sites listing Emilia Dering as a German writer born in 1885. The linkage between Wikidata and the Knowledge Graph as well as Bing's Snapshot can only make this effect more powerful: if falsehoods in Wikidata enter the infoboxes displayed by the world's major search engines, as well as the pages of countless re-users, the result could rightly be described as citogenesis on steroids.

The only way for Wikidata to avoid this is to establish stringent quality controls, much like those called for by Kmhkmh above. Such controls would appear absent at Wikidata today, given that the site managed to tell the world, for five months in 2014, that Franklin D. Roosevelt was also known as "Adolf Hitler". If even the grossest vandalism can survive for almost half a year on Wikidata, what chance is there that more subtle falsehoods and manipulations will be detected before they spread to other sites?

Yet this is the project that Wikimedians like Max Klein, who has been at Wikidata from the beginning, imagine could become the "one authority control system to rule them all". The following questions and answers are from a 2014 interview with Klein:

Given present quality levels, this seems like a nightmare scenario: the Internet's equivalent of the Tower of Babel.

What is a reliable source?

A crowdsourced project like Wikidata becoming "the one authority control system to rule them all" is a very different vision from the philosophy guiding Wikipedia. Wikipedians, keenly aware of their project's vulnerabilities and limitations, have never viewed Wikipedia as a "reliable source" in its own right. For example, community-written policies expressly forbid citing one Wikipedia article as a source in another (WP:CIRCULAR):

Wikidata abandons this principle—doubly so. First, it imports data referenced only to Wikipedia, treating Wikipedia as a reliable source in a way Wikipedia itself would never allow. Secondly, it aspires to become itself the ultimate reliable source—reliable enough to inform all other authorities.

For example, Wikidata is now used as a source by the Virtual International Authority File (VIAF), while VIAF in turn is used as a source by Wikidata. In the opinion of one Wikimedia veteran and librarian I spoke to at the recent Wikiconference USA 2015, the inherent circularity in this arrangement is destined to lead to muddles which, unlike the Brazilian aardvark hoax, will become impossible to disentangle later on.

The implications of a non-attribution licence

The lack of references within Wikidata makes verification of content difficult. This flaw is only compounded by the fact that its CC0 licence encourages third parties to use Wikidata content without attribution.

Max Klein provided an insightful thought on this in the interview he gave last year, following Wikimania 2014:

Klein seems torn between his lucid rational assessment and his appeal to himself to "really believe in the Open Source, Open Data credo". Faith may have its rightful place in love and the depths of the human soul, but our forebears learned centuries ago that when you are dealing with the world of facts, belief is not the way to knowledge: knowledge comes through doubt and verification.

What this lack of attribution means in practice is that the reader will have no indication that the data presented to them comes from a project with strong and explicit disclaimers. Here are some key passages from Wikidata's own disclaimer:

Internet users are likely to take whatever Google and Bing tell them on faith. As a form of enlightenment, it looks curiously like a return to the dark ages.

When a single answer is wrong

Jerusalem—one of the most contested places on earth

This obscuring of data provenance has other undesirable consequences. An article published in Slate this week (Nov. 30, see this week's In the Media) introduces a paper by Mark Graham of the Oxford Internet Institute and Heather Ford of the School of Media and Communication at the University of Leeds. The paper examines the problems that can result when Wikidata and/or the Knowledge Graph provide the Internet public with a single, unattributed answer.

Ford and Graham say they found numerous instances of Google Knowledge Graph content taking sides in the presentation of politically disputed facts. Jerusalem for example is described in the Knowledge Graph as the "capital of Israel". Most Israelis would agree, but even Israel's allies (not to mention the Palestinians, who claim Jerusalem as their own capital) take a different view – a controversy well explained in the lead of the English Wikipedia article on Jerusalem, which tells its readers, "The international community does not recognize Jerusalem as Israel's capital, and the city hosts no foreign embassies." Graham provides further examples in Slate:

Ford and Graham reviewed Wikidata talk page discussions to understand the consensus forming process there, and found users warring and accusing each other of POV pushing—context that almost none of the Knowledge Graph readers will ever be aware of.

In Ford's and Graham's opinion, the envisaged movement of facts from Wikipedia to Wikidata and thence to the Google Knowledge Graph has "four core effects":

This is a remarkable reversal, given that Wikimedia projects have traditionally been hailed as bringing about the democratisation of knowledge.

Conclusions

Errors can always be fixed

From my observation, many Wikimedians feel problems such as those described here are not all that serious. They feel safe in the knowledge that they can fix anything instantly if it's wrong, which provides a subjective sense of control. It's a wiki! And they take comfort in the certainty that someone surely will come along one day, eventually, to fix any other error that might be present today.

This is a fallacy. Wikimedians are privileged by their understanding of the wiki way; the vast majority of end users would not know how to change or even find an entry in Wikidata. As soon as one stops thinking selfishly, and starts thinking about others, the fact that any error in Wikidata or Wikipedia can potentially be fixed becomes secondary to the question, "How much content in our projects is false at any given point in time, and how many people are misled by spurious or manipulated content every day?" Falsehoods have consequences.

Faced with quality issues like those in Wikidata, some Wikimedians will argue that cleverer bots will, eventually, help to correct the errors introduced by dumber bots. They view dirty data as a welcome programming challenge, rather than a case of letting the end user down. But it seems to me there needs to be more emphasis on controlling incoming quality, on problem prevention rather than problem correction. Statements in Wikidata should be referenced to reliable sources published outside the Wikimedia universe, just like they are in Wikipedia, in line with the WP:Verifiability policy.

Wikidata development was funded by money from Google and Microsoft, who have their own business interests in the project. These ties mean that Wikidata content may reach an audience of billions. It may make Wikidata an even greater honey pot to SEO specialists and PR people than Wikipedia itself. Wikis' vulnerabilities in this area are well documented. Depending on the extent to which search engines will come to rely on Wikidata, and given the observed loss of nuance in Knowledge Graph displays, an edit war won in an obscure corner of Wikidata might literally re-define truth for the English-speaking Internet.

If information is power, this is the sort of power many will desire. They will surely flock to Wikidata, swelling the ranks of its volunteers. It's a propagandist's ideal scenario for action. Anonymous accounts. Guaranteed identity protection. Plausible deniability. No legal liability. Automated import and dissemination without human oversight. Authoritative presentation without the reader being any the wiser as to who placed the information and which sources it is based on. Massive impact on public opinion.

... to rule them all

As a volunteer project, Wikidata should be done well. Improvements are necessary. But, looking beyond the Wikimedia horizon, we should pause to consider whether it is really desirable for the world to have one authority—be it Google or Wikidata—"to rule them all". Such aspirations, even when flying the beautiful banner of "free content", may have unforeseen downsides when they are realised, much like the ring of romance that was made "to rule them all" in the end proved remarkably destructive. The right to enjoy a pluralist media landscape, populated by players who are accountable to the public, was hard won in centuries past. Some countries still do not enjoy that luxury today. We should not give it away carelessly, in the name of progress, for the greater glory of technocrats.

One last point has to be raised: Denny Vrandečić combines in one person the roles of Google employee, community-elected Wikimedia Foundation board member and Wikidata thought leader. Given the Knowledge Graph's importance to Google's bottom line, there is an obvious potential for conflicts of interest in decisions affecting the Wikidata project's licensing and growth rate. While Google and Wikimedia are both key parts of the world's information infrastructure today, the motivations and priorities of a multi-billion-dollar company that depends on ad revenue for its profits and a volunteer community working for free, for the love of knowledge, will always be very different.


Further reading


Andreas Kolbe has been a Wikipedia contributor since 2006. He is a member of the Signpost's editorial board. The views expressed in this editorial are his alone and do not reflect any official opinions of this publication. Responses and critical commentary are invited in the comments section.


















Wikipedia:Wikipedia Signpost/2015-12-02/Op-ed