Multilingual Wikipedia

In focus

Multilingual Wikipedia

Denny Vrandečić was the Wikidata director until September 2013 and was a member of the Wikimedia Foundation board of trustees from July 2015 to April 2016. He earned a PhD at the Karlsruhe Institute of Technology. He now works at Google. -S

Wikipedia’s mission is to allow everyone to share in the sum of all knowledge. Wikipedia is in its twentieth year, and it has been a success in many ways. And yet, it still has large knowledge gaps, particularly in language editions with smaller active communities. But not only there – did you know that only a third of all topics that have Wikipedia articles have an article on the English Wikipedia? Did you know that only about half of articles in the German Wikipedia have a counterpart on the English Wikipedia? There are huge amounts of knowledge out there that are not accessible to readers who can read only one or two languages.

And even if there is an article, content is often very unevenly distributed, and where one Wikipedia has long articles with several sections, another Wikipedia might just have a stub. And sometimes, articles contain very outdated knowledge. When London Breed became mayor of San Francisco, nine months later only twenty-four language editions had listed her as such. Sixty-two editions listed out-of-date mayors – and not only Ed Lee, who was mayor from 2011, but also Gavin Newsom, who was mayor from 2004 to 2011, and Willie Brown, who was mayor from 1996 to 2004. The Cebuano Wikipedia even lists Dianne Feinstein, who was mayor from 1978 to 1988, more than a decade before Wikipedia was even created.

This is no surprise, as half of the Wikipedia language editions have fewer than ten active contributors. It is challenging to write and maintain a comprehensive and current encyclopedia with ten people in their spare time. It cannot be expected that those ten contributors keep track of all the cities in the world and update their mayors in Wikipedia. In many cases those contributors would prefer to work on other articles.

Wikidata to the rescue?

This is where Wikidata can help. And in fact, it does: of the twenty-four Wikipedia language editions that listed London Breed as mayor, eight got that information from Wikidata, and were up-to-date because of that. But Wikidata cannot really tell the full story.

Ed Lee, then mayor of San Francisco, died of cardiac arrest in December 2017. London Breed, as the president of the board of supervisors, became acting mayor, but in order to deny her the advantage of the incumbent, the board voted in January 2018 to replace her with Mark Farrell as interim mayor until the special elections to finish the term of Ed Lee were held in June. London Breed won the election and became mayor in July until the next regular elections a year later which she also won.

Now there are many facts in there that can be represented in Wikidata: that there was a special election for the position of the mayor of San Francisco, that it was held in June, that London Breed won that election. That there was an election in 2019. That Mark Farrell held the office from January to July. That Ed Lee died of cardiac arrest in December 2017.

But all of these facts don’t tell a story. Whereas Wikidata records these facts, they are spread throughout the wiki, and it is very hard to string them together in a way that allows a reader to make sense. Even worse, these facts are just a very small set of the billions of such facts in Wikidata, and for a reader it is hard to figure out which are relevant and which are not. Wikidata is great for answering questions, creating graphs, allowing data exploration, or making infobox-like overviews of a topic, but it is really bad at telling even the rather simple story presented above.

We have a solution for this problem, and it’s quite marvelous: language. Language is expressive, it can tell stories, it is predestined for knowledge transfer. But also, there are many languages in the world, and most of us only speak a few of them. This is a barrier for the transfer of knowledge. Here I suggest an architecture to lower this barrier, deeply inspired by the way language works.

Imagine for a moment that we start abstracting the content of a text. Instead of saying "in order to deny her the advantage of the incumbent, the board votes in January 2018 to replace her with Mark Farrell as interim mayor until the special elections", imagine we say something more abstract such as elect(elector: Board of Supervisors, electee: Mark Farrell, position: Mayor of San Francisco, reason: deny(advantage of incumbency, London Breed)) – and even more, all of these would be language-independent identifiers, so that thing would actually look more like Q40231(Q3658756, Q6767574, Q1343202(Q6015536, Q6669880)). On first glance, this looks much like a statement in Wikidata, but merely by putting that in a series of other such abstract statements, and having some connecting tissue between these bare-bones statements, we are inching much closer to what a full-bodied text needs.

A new project: a wiki for functions

But obviously, we wouldn’t show this abstract content to the readers. We still need to translate the abstract content to natural language. So we would need to know that the elect constructor mentioned above takes the three parameters in the example, and that we need to make a template such as {elector} elected {electee} to {position} in order to {reason} (something that looks much easier in this example than it is for most other cases). And since the creation of such translators has to be made for every supported language, we need to have a place to create such translators so that a community can do it.

For this I propose a new Wikimedia project, preliminarily called Wikilambda (and I am terrible with names, so I do not expect the project to be actually called this). Wikilambda would be a new project to create, maintain, manage, catalog, and evaluate a new form of knowledge assets: functions. Functions are algorithms, pieces of code, that translate input into output in a determined and repeatable way. A simple function, such as the square function, could take the number 5 and return 25. The length function could take a string such as "Wikilambda" and return the number 10. Another function could translate a date in the Gregorian calendar to a date in the Julian calendar. And yet another could translate inches to centimeters. Finally, one other function, more complex than any of those examples, could take an abstract content such as Q40231(Q3658756, Q6767574, Q1343202(Q6015536, Q6669880)) and a language code, and give back the text "In order to deny London Breed the incumbency advantage, the Board of Supervisors elected Mark Farrell Mayor of San Francisco." Or, for German, "Um London Breed den Vorteil des Amtsträgers zu verweigern, wählte der Stadtrat Mark Farrell zum Bürgermeister von San Francisco."

Wikilambda will allow contributors to create and maintain functions, their implementations and tests, in a collaborative way. These include the available constructors used to create the abstract content. The functions can be used in a variety of ways: users can call them from the Web, but also from local machines or from an app. By allowing the functions in Wikilambda to be called from wikitext, we also allow to create a global space to maintain global templates and modules, another long-lasting wish by the Wikimedia communities. This will allow more communities to share expertise and make the life of other projects such as the Content Translation tool easier.

This will allow the individual language communities to use text generated from the abstract content, and fill some of their knowledge gaps. The hope is that writing the functions that translate abstract content, albeit more complex, is also much less work than writing and maintaining a full-fledged encyclopedia. This will also allow smaller communities to focus on the topics they care about – local places, culture, food – and yet to have an up-to-date coverage of globally relevant topics.

What do you think?

To make it absolutely clear: this proposal does not call for the replacement of the current Wikipedias. It is meant as an offer to the communities to fill in the gaps that currently exist. It would be presumptuous to assume that a text generated by Wikilambda would ever achieve the brilliance and subtlety that let many of our current Wikipedia articles shine. And although there are several advantages for many parts of the English Wikipedia as well (say for global templates or content that is actually richer in a local language), I would be surprised if the English Wikipedia community would start to widely adopt what Wikilambda offers early on. But it seems that it is hard to overestimate the effect this proposal could have on smaller communities, and eventually on our whole movement in order to get a bit closer to our vision of a world in which everyone can share in the sum of all knowledge.

I invite you to read my recently published paper detailing the technical aspects and an upcoming chapter discussing the social aspects of this proposal. I have discussed this proposal with several researchers in many related research areas, with members of different Wikimedia communities, and with folks at the Wikimedia Foundation, to figure out the next steps. I also invite you to discuss this proposal, in the Comments section below, or on Meta, on Wikimedia-l, or with me directly. I am very excited to work toward it and I hope to hear your reservations and your ideas.

Update (May 8, 2020): An official proposal for Wikilambda is now up on Meta. Discussion and support can be expressed there.

← Previous "In focus"

Next "In focus" →

In this issue

26 April 2020 (all comments)

News and notes

In the media

Discussion report

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

I was surprised to learn that almost half of the articles in German don't have English language concurrent articles. This is not very healthy as we have so many notable Germans with no mentions in English Wikipedia and we are being deprived of vital information. Now I am not suggesting that we create English language articles for all of those, it would be too time-consuming. But at least for a certain number of such articles, it is well worth our while to do what's necessary. We could set a target that within the next 2 years, necessary steps would be taken to cover let's say 75% of the German articles in English instead of the present 50%. That would add a huge number of articles of notable German individuals and subjects, wouldn't it. If this cannot be presently achieved for some reason, we could at least create redirect pages in English Wikipedia with the appropriate short description, defaultsort, language links and appropriate categories with the resulting redirect page pointing to the German page for now. Or, instead of a redirect page, a template would be created that will inform reader that there is actually a German language article for the said individual or subject. This could also similarly apply for say Italian, Spanish and French language articles as a start, possibly also Arabic, Portuguese and Chinese at a later stage. The "translate" feature would do the rest once we are led to the language page of the non-existant English language article. werldwayd (talk) 20:08, 26 April 2020 (UTC)[reply]

I think a first step would be to assess why we don't have an English article matching one on another wiki. Is it solely because of a language barrier? Is the concept covered by a different article? Does the explanation lie in differing notability guidelines between projects? A translate-a-thon is not a bad idea in theory but further analysis would be useful in better understanding the underlying issues. Nikkimaria (talk) 20:29, 26 April 2020 (UTC)[reply]

Here's a quick query that shows you a few hundred German articles without one in English (should be easy to change the language on that): https://w.wiki/P7d

Probably even more interesting is this, the list of articles that exist both in the German and Spanish Wikipedia, but not in English: https://w.wiki/P7f

This is just for starting this investigation, obviously, we should have a much deeper analysis, I agree with that. --denny vrandečić (talk) 20:55, 26 April 2020 (UTC)[reply]

Just because the article is in German and there's no English equivalent doesn't mean that that the article is about a German subject. I've encountered many articles in German about Australian athletes. Hawkeye7 (discuss) 01:23, 27 April 2020 (UTC)[reply]

I have translated a few articles from German and other languages to English. In some other cases, the article I wanted to "port" had insufficient references per enwp requirements. This is not necessarily a surmountable problem with just more translators. Notability and BLP documentation standards differ between different Wikipedia language editions, and enwp seems to have a relatively high bar. Example, I just created a stub for Paragraf Lex. It has zero references on Serbian Wikipedia. It appears to be somewhat notable; at least, it is cited quite a few times by English Wikipedia. ☆ Bri (talk) 21:00, 26 April 2020 (UTC)[reply]

Indeed, I'd expect us to have the highest bar as the largest Wikipedia (by some metrics). Lower bars encourage expansion of content and accumulating an editor base, which is good for small Wikipedias, whilst higher bars encourage improving the quality of already-existing content. Nonetheless, we do have systemic biases and no doubt there is a lot of useful translation that can be done, even if the aim is not to make an article for every subject on the German Wikipedia. — Bilorv (talk) 21:18, 26 April 2020 (UTC)[reply]

When we talk about articles existing in one language Wikipedia not being present in the English language Wikipedia, it doesn't always prove that there is a deficit needing to be addressed. For example, a few years back I was surprised to find the Bulgarian language Wikipedia has an article on every Consul of the Roman Empire known -- who number about 1,400 between 30 BC & AD 235. (en.wikipedia has somewhat more than 1,000.) I was impressed by that, & took a close look at a few ... only to find they were the most basic of stubs, consisting of little more information than "X was a politician of ancient Rome. X was a consul in year A with Y as his colleague", & some fancy templates. (Google translate works wonders in cases like this.) No sense translating stubs like these to the English Wikipedia; we create enough stubs on our own. -- llywrch (talk) 03:40, 27 April 2020 (UTC)[reply]

"Almost half" means there are more than one million articles in the German language-version of Wikipedia that are not in the English version. So, regarding "We could set a target that within the next 2 years, necessary steps would be taken to cover let's say 75% of the German articles in English instead of the present 50%", there is no feasible path - today - to translate 500,000 articles in two years. There probably isn't a feasible path to translate even 5,000 articles, if by "feasible" we mean "finding volunteers who speak both languages fluently, and aren't busy doing other things". If we're going to get massive amounts of content from one language version into other language versions, the only way to do that is with computer-based processes, lightly reviewed by humans. Or by a donation of several billion dollars from a very well-endowed foundation or philanthropist. -- John Broughton (♫♫) 23:18, 26 April 2020 (UTC)[reply]
What we really need is an automated translation tool that translates everything but the text. By which I mean the templates, links, tables, categories etc. This would greatly reduce the effort required. Hawkeye7 (discuss) 01:23, 27 April 2020 (UTC)[reply]
But what we normally don't need is machine-translated content...have a look at pages for translation; there, many articles which have been machine-translated are listed, and are waiting to be evaluated/translated/copyedited, some of them for years . What we regulars over there that have learned: it's much more efficient and less time-consuming to write articles from scratch, using the foreign-language sources used in the other language article. Of course only if no English sources can be found. Lectonar (talk) 10:22, 27 April 2020 (UTC)[reply]

The Content translation tool can offer that really well! @Amire80: for cc --denny vrandečić (talk) 16:07, 27 April 2020 (UTC)[reply]

Thanks, denny vrandečić :)

Hawkeye7, yes! Computers should translate what they can translate reliably: code, data, etc., and humans should translate prose. I'd go even further and say that ideally humans really should not translate things that can be reliably translated by computers. A good multilingual systems is supposed to strive for this: automate everything that can be reliably automated. I'm also fine with Denny's proposal in principle, because to the best of my understanding, what it suggests is auto-generating boilerplate prose from data reliably, while allowing people to write their own prose.

As Denny says, Content Translation kind of does it, although not perfectly. It's pretty good at transferring links, images, and tables. Links are mostly a breeze, if the article exists in the language into which you are translating and they are connected using a Wikidata sitelink. Images are a breeze if they are on Commons. It's not perfect with complex tables because they are, well, complex, especially those that have a lot of columns and are too wide to fit in a narrow column, but yeah, it kind of works. (The real solution for complex tables is to try thinking of storing what they show in a proper database, and then get the data to display in articles using queries that auto-adapt to different media. It would be a difficult project that will require a lot of infrastructure, but it's worth thinking about. But I digress.)

Categories are a breeze, as long as directly corresponding categories had been created in the language into which you are translating. What often happens in practice is that the English Wikipedia's category tree is more complex because it has more articles and more need for deeper categories, so categories have to be manually added after the article created.

Another thing you didn't mention is language-independent content, most notably math formulas.

And this brings us to templates, which are the biggest pain. Translating templates works nicely in Content Translation if the corresponding template exists in the language into which you are translating, and all of its parameters had been correctly mapped using TemplateData. Templates and modules are currently stored on each wiki separately, so this has to be done for every template in every wiki, and in practice this doesn't scale. It must be possible to make templates shareable, or global. I wrote a proposal for this: mw:Global templates/Proposed specification, short version. Your opinion about this is very welcome.

Lectonar, you are generally right, but here's the more nuanced approach to machine translation. If machine translation of a text from another language is posted as a Wikipedia article, this is worse than useless. This absolutely must not be done, ever. If, however, machine translation is used by a responsible human who corrects all the mistakes that it made and makes sure that the text reads naturally, has true information, and is well-adapted to the reader in the target language, and then this text is posted as a Wikipedia article, then it's indistinguishable from translation. If machine translation helped this human translator do it more quickly, then it was beneficial.

Some people who translate texts find machine translation totally useless and prefer to translate everything from scratch. This is totally fine, too, but it's not absolute: There are also people who find that machine translation makes them more efficient. As long as they use it responsibly and post text only after verifying every word, this is okay. --Amir E. Aharoni (talk) 07:38, 28 April 2020 (UTC)[reply]
Responding more directly to the article, thanks again to denny vrandečić for mentioning global templates. The two ideas are indeed related, although I'd say that we should make it possible to have globally shareable modules and templates first, and then proceed to complete something like Wikilambda. Here's my rationale: The syntax for writing modules and templates is familiar to many editors. The global templates proposal specifically says that it doesn't suggest changing anything about the programming languages in which they are written: wiki syntax, parameters, parser functions, Lua, etc. It only suggests a change in where they are stored: transitioning from the current state, in which modules and templates can only be stored and used on a single wiki, to having the option of storing some of them, those that are useful to multiple wikis, on a shared repository (while preserving the option of having local modules and templates). This is similar to having to option to store images on Commons. I've read the whole Wikilambda paper, and my impression is that while it's probably not finalized, it's already clear enough that the proposed Wikilambda "renderer" language is a new thing that will be significantly different from wiki syntax and Lua. This is legitimate, but it makes a lot more sense to start a global repository from something familiar. In addition, developing a global repository will probably require updating some things deep in the core MediaWiki platform so that the performance of change propagation across sites will be better, and this will also improve the performance of some existing things, most importantly Commons and Wikidata. Once this core improvement is done for familiar things like images (Commons), structured data (Wikidata), and wiki syntax (modules and templates), it will be easy to build Wikilambda upon it. --Amir E. Aharoni (talk) 08:12, 28 April 2020 (UTC)[reply]
This makes a lot of sense to me, but does not strike me as a botteleneck. It will take some time to set up the basic site + processes for Wλ, and we can pursue a "global sharing" framework at the same time which will be useful for Wλ once it gets underway. – SJ + 23:22, 5 May 2020 (UTC)[reply]

If I undertake to write an article for English Wikipedia, I generally have to find English-language sources. German-language sources aren't much help to my English-speaking readers.

In effect, notability is defined per-language. For any particular article in German Wikipedia, the topic may not be suitable for English Wikipedia, if there are not enough appropriate English-language sources. Bruce leverett (talk) 18:57, 1 May 2020 (UTC)[reply]

This is not an issue. According to WP:NOENG, English sources are preferred in the English Wikipedia, but non-English sources are allowed, and this is sensible. As for notability, my understanding is that this Multilingual Wikipedia / Abstract Wikipedia / Wikilambda proposal doesn't intend to force any article to appear in any language, but only to give an easier way to auto-generate basic articles in languages that are interested in them. --Amir E. Aharoni (talk) 12:16, 2 May 2020 (UTC)[reply]

Non-English sources are "allowed", but, to repeat what I said above, "German-language sources aren't much help to my English-speaking readers". Yes, I expect people to read the footnotes, and click on them. I understand that for some articles, including some that I have worked on, this isn't an issue. But the implication is that creating a non-stub English-language version of a foreign-language article is more than just running Google translate and fixing up the results -- much more. I'm not scoffing; in many cases of, for example, chess biographies, I have yearned to be able to transplant the knowledge from a foreign-language article to English. But it's only a little less work than writing a new article from scratch. Bruce leverett (talk) 18:42, 2 May 2020 (UTC)[reply]

@Bruce leverett: Agreed. There's also nothing that would stop us from using a cite mechanism in the Abstract Wikipedia that prefers sources in the language of the Wiki when available, and only falls back to sources in other languages if none is given. I guess it is still better to have a source in a foreign language than have no source at all, but I totally understand and agree with the idea that sources in the local language should be preferred on display. --denny vrandečić (talk) 20:48, 11 May 2020 (UTC)[reply]

I expect people to read the footnotes, and click on them You're going to be very disappointed. Not only don't they read the footnotes, sometimes they post questions on the talk page admitting that they didn't read the article.

Hawkeye7 (discuss) 23:26, 5 July 2020 (UTC)[reply]

I have read both the Signpost article and the separate article. In theory, "Wikilambda" sounds like a good idea. In practice, however, I think it would be too complex to implement. I have been interested in, and have had some ability in, computer programming since I was first exposed to computers in the late 1970s. I have been a Wikipedia editor for more than a decade, and have translated hundreds of Wikipedia articles from another language (mostly either German or Italian) into English (usually with the assistance of Google translate). But I really doubt whether I would have the computing skills to contribute anything to "Wikilambda"; I found it difficult enough to draft the code necessary to transclude Wikidata information into Wikipedia infoboxes, to such an extent that I had to procure another Wikipedia editor to help me. "Wikilambda" seems so hard to grasp that I don't think I would even try to get involved in it. Maybe if the proposer could convince enough computer geeks who speak more than one language fluently to become contributors to "Wikilambda", then it might be able to get off the ground. But I have my doubts. Bahnfrend (talk) 13:07, 2 May 2020 (UTC)[reply]

@Bahnfrend: But isn't that true for Wikipedia in general? We have people with different skill sets working together. Bots written in Python, templates written with many curly braces, modules in Lua, tables, images, categories, and beautiful natural language text.

The important part is that the actual content can be contributed by many people, because that is where we must make sure that the barrier is low. This is what the project really needs to get right, and it devotes quite a few resources to this challenge.

For Wikilambda itself, yes, that's a very different kind of beast - and will have a different kind of community with a different set of contributors. But they don't have to be the same contributors that contribute to the Content of the Abstract Wikipedias. But again, as in Wikipedia we will have volunteers with different skill sets working together and achieving more than they could alone. --denny vrandečić (talk) 20:53, 11 May 2020 (UTC)[reply]

Always a promising idea, and seems very doable now that related substructure is available. Let us! I suggest EO as an early target wiki, for all of the reasons. – SJ + 23:22, 5 May 2020 (UTC)[reply]

Thank you! --denny vrandečić (talk) 20:49, 11 May 2020 (UTC)[reply]

[Moved my comments to meta:Talk:Wikilambda#Kaldari's_concerns] Kaldari (talk) 21:52, 8 May 2020 (UTC)[reply]

Answered there, thanks! --denny vrandečić (talk) 20:49, 11 May 2020 (UTC)[reply]

Update: the proposal has been approved by the Wikimedia Board of Trustees. --denny vrandečić (talk) 17:49, 5 July 2020 (UTC)[reply]

What do you think of The Signpost? Share your feedback.

Home

About