Wikidata: the new Rosetta Stone

Op-ed

Wikidata: the new Rosetta Stone

Data is beautiful. Data is information.

For this reason among many others, in 2012 the Wikimedia Foundation created Wikidata: a collaborative, multilingual database that aims to provide a common source for certain types of data such as dates of birth, coordinates, names, and authority records, managed collaboratively by volunteers around the world. This means that when a change of government occurs, for example, simply updating the corresponding element on Wikidata will automatically update all the applications that are linked to it, be it Wikipedia or any other third-party application. It means that we do not have to constantly reinvent the wheel. This collaborative model helps to reduce the effects of the existing cultural diglossia, given that small communities can have a greater global impact in a more efficient manner. In the medium term, all Wikidata queries will include data from all over the world, not just from the cultures or historical communities with greater power to influence. A search for “doctors who graduated before they turned 20”, for example, will not only display French and English doctors, but also doctors from Taiwan and Andorra.

This project opens up a whole new world of possibilities, for collaboration and for using the data: the Wikidata game allows users to make thousands of small contributions while playing, even from a mobile phone while waiting for a bus. Inventaire allows people to share their favourite books, and Histropedia offers a new way of visualising history through timelines. Meanwhile, scientists from around the world are uploading their research databases, and the cultural sector is building a database of paintings from all over the world . All of these projects run on the Wikidata engine, which is becoming a new international standard.

And why Wikidata and not some other project? Internet standards do not necessarily become accepted because of their ability to generate authority, but because of their capacity to generate traffic, or their capacity to be updated. The winner is not the best, but the one that can assemble the greatest number of users and be updated more quickly. This is one of the strengths of the Wikidata project, given that thousands of volunteers are constantly updating the information. As a result, any application or project based on big data can take advantage of all of this structured knowledge, and do so free of charge. All of this means that we have to reconsider the role that traditional agents of knowledge (universities, research centres, cultural institutions) want to play, and the role or the possible role of the repositories of authorities around the world, now that new tools are mixing and matching and creating a new centrality.

Cultural institutions, for example, have to deal with the challenge of the lack of standard matching criteria used to document artworks in their catalogues, such as for example: dimensions with frame, without frame, with or without passe-partout, descriptions in text format, number fields… institutions have to bring order to their own data at home before opening up to the world. Being open means interoperability. Many institutions are already adapting: authority file managers such as VIAF are openly collaborating with Wikidata, and the Museum of Modern Art has also started using it in its catalogue. In Catalonia, Barcelona University, in collaboration with Amical Wikimedia, is behind one of groundbreaking projects in this field, which aims to create an open database of all works of Catalan Modernism.

Data is not knowledge. Data is not objective.

The Rosetta Stone

Data in itself is not knowledge. It is information. With the emergence of a new, very dense ecology of data that is accessible to everybody, we run the risk of trying to over-simplify the world: a description, no matter how detailed, will not necessarily make us understand something. Knowing that Dostoyevsky was born in 1821 and died in 1881 and that he was an existentialist is not the same as understanding Dostoyevsky or existentialism. Now more than ever, we need tools that will help us to contextualise information, to develop our own point of view, and to generate knowledge based on this information, in order to promote a society with a strong critical spirit. And we shouldn’t forget that data in itself is not objective either, even though it supposedly purports to be neutral. Data selection is a bias in itself. The decision of whether or not to analyse the gender, origin, religion, height, eye colour, political position, or nationality of a human group can condition the subsequent analysis. Codifying or failing to codify a particular item of information within a data set can both inform and disguise a particular reality. Data is useless without interpretation.

The impact of the emergence of Wikipedia on traditional print encyclopaedias is common knowledge. What will be the impact of Wikidata? In line with the wiki philosophy, the work is done collaboratively in an asymmetric but ongoing process. We can all collaborate in the creation and maintenance of the content, but also of the vocabulary, of the properties of different items, and of the taxonomies used to classify the information. We are deciding how to organise existing information about the world, and we are doing it in an open, participatory manner, as an example of the potential of technology. We know that human knowledge evolves cumulatively, and that Western culture is essentially inherited. Our reality is determined, in a sense, through the technological, social, political, and philosophical advances of those who came before us. This means that today’s generations don’t have to discover electricity all over again, for example. We enjoy the fruits of the efforts of our ancestors. But the Internet, for the first time, allows us to be involved in a phenomenon that will mark human history: we are defining and generating a new information ecosystem that will become the foundation for a possible cognitive revolution. And we are lucky to be able to participate, question, and improve it as it evolves. Together, we can participate in a historic project on a par with humanity’s greatest advances. We can create a new Rosetta Stone that can serve as an open, transparent key to unlock the secrets of today’s world, and perhaps as a documentary source for future generations or civilisations. Let us take responsibility for it.

This article originally appeared on the CCCB Lab blog of the Centre de Cultura Contemporània de Barcelona and is reprinted here with the permission of the author.

Next "Op-ed" →

In this issue

25 November 2015 (all comments)

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

How old was someone, knowing that he was born in 1821 and died in 1881? Maybe 1881-1821=60 years old. But born 1821-01-01, died 1881-12-31 gives 61 years old, while born 1821-12-31, died 1881-01-01 gives 59 years old. But there are countries where the birth of a child is her first anniversary. But there are lunar years. And what remains is something between 58 and 63 years old. When someone is reported as 1821--1881, this is even worse. And therefore, the question is not about what is written in the database, but about the confidence we can give to the way the data were collected to build the database. E.g. what says Wikidata about the death of Kim Hong-do ? Pldx1 (talk) 08:25, 30 November 2015 (UTC)[reply]

“doctors who graduated before they turned 20” – How would this query look like?--Kopiersperre (talk) 15:41, 30 November 2015 (UTC)[reply]

What do you mean by "and that Western culture is essentially inherited"? 4nn1l2 (talk) 16:45, 30 November 2015 (UTC)[reply]
- My guess is, it's an assertion that other cultures are new or have a new essence. This would be appropriately, pretentiously, silly. Jim.henderson (talk) 22:40, 30 November 2015 (UTC)[reply]

"For this reason among many others, in 2012 the Wikimedia Foundation created Wikidata": I don't really want to be that guy, but this is false. We either write Wikimedia Deutschland or "the Wikimedia movement". Aubrey (talk) 17:47, 30 November 2015 (UTC)[reply]
- Probably is not the best sentence. Post was originally written for a non-wiki audience and was an intend of storifying the message. I do agree with you.--Kippelboy (talk) 09:56, 7 December 2015 (UTC)[reply]

Wikimedia-l discussion, Slate article

There is an ongoing discussion about Wikidata's quality issues and their wider implications on the Wikimedia-l mailing list: http://www.gossamer-threads.com/lists/wiki/foundation/654001

A key fact here is that at present, only about 20% of Wikidata content is referenced to a reliable source. About half is unreferenced, and about a third is only referenced to a Wikipedia. [1]

For wider context, see yesterday's article in Slate exploring the links between Wikidata and Google's Knowledge Graph: "Why Does Google Say Jerusalem Is the Capital of Israel?" Andreas JN 466 15:54, 1 December 2015 (UTC)[reply]

To be fair, regarding the 20% number: let's take a random Featured Article in Wikipedia. Such as Emma Goldman. Looking at the four paragraph intro, it contains tons of information, but only 3 of the claims made in the intro have references. Her founding an anarchist journal? No reference. Her being sentenced to 22 years in prison? No reference. Her date of birth? No reference. There are much more than 15 claims in the intro, but only 3 references. So the 20% of facts in Wikidata having a reference could also be interpreted as a much higher number than what Wikipedia offers. Much more than half of all claims in Wikipedia are without reference, probably much more than 90%. Now, obviously, this is no reason to say all is rosy for Wikidata, because Wikipedia is even worse - but I am questioning whether the metric, as presented here, is very valuable. --denny vrandečić (talk) 22:25, 2 December 2015 (UTC)[reply]

Denny, that's based on a lack of familiarity with citing conventions for article leads. See WP:CITELEAD; it is longstanding practice to use citations sparingly in the lead paragraphs. The lead is intended to summarise the article content; it should not contain anything that isn't covered, and sourced, in the article body. That is where the sources for those statements are found. Andreas JN 466 08:53, 3 December 2015 (UTC)[reply]

You are right, I was unfamiliar with that citing convention (and I like the convention a lot). Of the three claims that I mentioned two have indeed references later (the founding of the magazine and the prison sentence) and one does not (the date of birth). But many claims in the body of the article remain without reference - her list of publications, for example. Or if you take the first paragraph of the article body, it has two references but many more claims (although it is admittedly hard to discern what exactly a reference contains).

I do not say that each of these have to have references. That would make it so much harder to read, and some claims are just obvious. In Wikidata there are claims like "the first name of Emma Goldberg is Emma", which, I mean, does it really need a reference? Or "Living my Life was written by Emma Goldman". Again, does this really need a reference?

What I want to say is - the percentages you mention are hard to interpret. What would be a good number? Is it really captured in a simple number? What is the comparison coming from Wikipedia? A lot of the referencing and citation rules on Wikidata still need to mature. What is a good reference? What needs to be referenced, and what not? Etc. Wikidata is still a young project, and it needs to find its rules. Wikipedia's citation rules were not as developed in 2004 as they are today, and Wikidata needs the time and the opportunity to find the correct set of rules as well. And every Wikipedian is invited to help at Wikidata.

Does this make any more sense? --denny vrandečić (talk) 17:46, 3 December 2015 (UTC)[reply]

The Emma Goldman article became a featured article in 2007, nearly 8 years ago. Quite possibly, it needs some work to make it conform to present-day standards. The birth date certainly should be referenced. Arguably, it is verifiable from the reference present at the end of these three sentences: Emma Goldman was born on June 27, 1869. Her father used violence to punish his children, beating them when they disobeyed him. He used a whip only on Emma, the most rebellious of them.<ref>Chalberg, p. 13.</ref> Chalberg gives the birth date in the same passage (though it is on page 12, not page 13). Would I think that a birth date like that should be referenced in Wikidata? Absolutely. Similarly, most of the bibliography is verifiable, given that each of her works bar one has its own article, complete with bibliographical data. If the biography were at WP:FAC today, I would argue for holding promotion back until at least the ISBN numbers for Goldman's works are included, making verification that these works actually exist a matter of a single click on the ISBN number. Again, if we were in Wikidata, I would consider the addition of a reference like that (i.e. the ISBN number of the book's first edition) essential.

As was recently pointed out by another contributor in the mailing list discussion, Wikidata's role makes it all the more vital that its statements be referenced, because their content is likely to be copied. Given wikis' open structure, it is not uncommon for people to add false information. See for example Wikipedia, the 25–year–old student and the prank that fooled Leveson: An American man wrongly named in the Leveson Report as a founder of The Independent newspaper has expressed surprise that a judge would accept without question information on Wikipedia. Or see the case of Hannibal Fogg, which involved the invention of an author and of books that had never existed. Or see the invention of a film director who had never lived, except on the pages of Wikipedia: The greatest movie that never was. (That is a really, really good article, worth reading for its writing as well as the story it's telling.) Or see the Amelia Bedelia hoax, whose content could conceivably have been included as a statement in Wikidata. See the Brazilian aardvaark story, told in the New Yorker; again this concerns a snippet of information that could easily have been accommodated in Wikidata's statement structure. (As I pointed out on Wikimedia-l, Wikidata said for five months last year that Franklin D. Roosevelt was also known as "Adolf Hitler" – too obvious to be copied by anyone, unlike the Brazilian aardvaark moniker that entered multiple "reliable" sources.) Just today, there is this story on dozens of major news sites: This 'legend' changed a Wikipedia page to sneak backstage at gig.

Wikidata need not and should not fall into the same ditches that plagued Wikipedia during its early years, and still continue to plague it to some extent today. Instead, Wikidata would do well to take the lessons learned in Wikipedia's early years on board, because the danger is that anything present in Wikidata may come to be copied not just across several Wikipedias, but also by Google and multiple third-party sources taking either Google's or Wikidata's or Wikipedia's statement on faith. This could lead to widespread contamination of sources everywhere ("citogenesis on steroids"). Insisting on strict sourcing standards is, in my opinion, absolutely vital, given the role envisaged for Wikidata. Otherwise you are not just creating intractable problems for yourselves, some months or years down the line, but also for all reusers.

One thing I will now go and do, Denny, is insert the reference for Goldman's birth date at the end of that sentence naming it. ;) Andreas JN 466 19:25, 3 December 2015 (UTC)[reply]

Also, Andreas, another difference between Wikipedia and Wikidata is that the latter is growing much faster than the former ever did. Otherwise +1 to your points, especially "Insisting on strict sourcing standards is, in my opinion, absolutely vital, given the role envisaged for Wikidata." (emphasis mine). Ed ^{[talk] [majestic titan]} 02:31, 4 December 2015 (UTC)[reply]

Mass updates

Wikidata has some way to go but has the potential to be a massive help to building and maintaining Wikipedia. For me, the biggest advantage is the ability to store information in once place that's referenced in many Wikipedia articles, and updated suddenly. The example was given of election results; I'm still finding many articles that list incorrect members of parliament or local councillors because they haven't been updated and there's no central reference of which articles contain such information. Another prime example is census data; many UK geography articles still list the population as at the 2001 census, not the (more recent) 2011 census or any of the subsequent population estimates from the Office for National Statistics.

Working through articles that find such information to update them is time consuming and mindnumbingly dull. Because we prefer to write information in prose, writing a bot to do it isn't really an option; using templates could work but would be much harder to update than Wikidata's slick user interface is. Out of date governance and demographic information is a big problem in geographical articles and Wikidata solves that problem for us; that alone is reason enough to embrace it and welcome it with open arms. Yes, it has flaws, but let's remember it's in its infancy. When someone views an article and sees a population figure that's 14 years out of date, it doesn't make us look good. So I say let's put the effort in to make WikiData work for us. Waggers TALK 11:26, 4 December 2015 (UTC)[reply]

What do you think of The Signpost? Share your feedback.

Home

About