The Signpost

Op-ed

Wikipedia's lead sentence problem

Thomas Spencer Baynes, genius or pedant?

In the 9th edition of the Encyclopædia Britannica, editor Thomas Spencer Baynes introduced the convention of including a person's birth and death year after their name in all biographical articles:

CAMPBELL, John, LL.D. (1708–1775), a miscellaneous author, was born at Edinburgh, March 8, 1708.

This allowed a reader to more easily distinguish between the 100+ notable people named John Campbell (only one of whom was actually lucky enough to get an article in the 9th edition). Although this convention was a bit awkward and redundant, it served a useful purpose (in the absence of disambiguation pages), and was kept in all subsequent editions.

When Wikipedia was created in 2001, it sought to emulate the successful model of the Encyclopædia Britannica and many editors adopted the convention of including birth and death years in the lead sentence.[1] Here is the lead sentence for Christopher Columbus as it appeared on June 13, 2001:

Christopher Columbus (1451?–1506) was a probably Genovian sailor who crossed the Atlantic in service of Spain.

Little did Thomas Spencer Baynes realize, Wikipedia editors would eventually expand on his convention, including not only birth and death years, but entire birth and death dates, birth and death dates in alternate calendars, birth and death locations, alternate names, maiden names, foreign names, pronunciations, foreign pronunciations, and transliterations. Fifteen years later, here's what Christoper Columbus's lead sentence had become:

Christopher Columbus (/kəˈlʌmbəs/; Ligurian: Cristoffa Combo; Italian: Cristoforo Colombo; Spanish: Cristóbal Colón; Portuguese: Cristóvão Colombo; Latin: Christophorus Columbus; born between 31 October 1450 and 30 October 1451 in Genoa – died on 20 May 1506 in Valladolid) was an Italian explorer, navigator, colonizer, and citizen of the Republic of Genoa.

Flesch Reading Ease scores for the lead sentence of Christopher Columbus from 2002 to 2016

What began as a concise, encyclopedic sentence had slowly grown into a sprawling mess of multiplying metadata—a sentence so complicatingly packed as to render it unreadable.[2] This isn't just a subjective opinion, either. If you chart the Flesch Reading Ease score of the sentence over the years, you'll see an almost continuous decline since 2002. This is by no means an isolated example, either. The metadata virus has spread from biographical articles to other subjects as well, like geography:

Israel (/ˈɪzrəl/; Hebrew: יִשְׂרָאֵל Yisrā'el; Arabic: إِسْرَائِيل Isrāʼīl), officially the State of Israel (Hebrew: מְדִינַת יִשְׂרָאֵל Medīnat Yisrā'el [mediˈnat jisʁaˈʔel]; Arabic: دَوْلَة إِسْرَائِيل Dawlat Isrāʼīl [dawlat ʔisraːˈʔiːl]), is a country in the Middle East, on the southeastern shore of the Mediterranean Sea and the northern shore of the Red Sea.

The problem has become so noticeable that many reusers of Wikipedia content (including the WMF itself) have started stripping out parenthetical phrases from the lead sentence in certain contexts. If you search for "Christopher Columbus" on Google, you'll see a much more digestible description, both in the Knowledge Graph and under the Wikipedia search result. If you turn on the Page Previews beta feature in your preferences and hover over Christopher Columbus, you'll also see a much shorter version. The Wikipedia apps even experimented with removing parenthetical phrases from the lead sentences in the articles themselves. This has led to heated debates about whether or not we are potentially removing important information (as some parenthetical phrases consist of content other than metadata). Without a clear way to identify which parenthetical phrases are useful and which are detrimental, I'm sure these issues will remain unresolved. What's really needed is a vigorous debate by the Wikipedia community about how to bring this problem under control and make our articles readable again.

If we don't take significant steps to address this problem, the metadata disease is only going to keep multiplying and spreading. If left unchecked, I fear this is what our future will look like:

[Excerpt from the Americapedia article about Wikipedia, copyright 2034, used with permission.]

...Like frogs in a pot of boiling water, the proliferation of lead sentence metadata happened so slowly that no one noticed until 2021 when John Seigenthaler's son published a devastating video on ClickNews in which he read aloud the lead sentence of his Wikipedia article, and then wept for 3 minutes.

John Michael SeigenthalerQ1701714 on Wikidata (English pronunciation: /ˈdʒɑn ˈmaɪkəl ˈsiːɡənθɔːlər/ ; German pronunciation: [ˈjuːˈan ˈmaɪkəl ˈziːkənθɔːlər] ; born December 21, 1955 in Nashville, TennesseeQ23197 on Wikidata, current resident of Weston, ConnecticutQ662537 on Wikidata (as of 2008), not yet deceased), also known as John Seigenthaler Jr. (English pronunciation: /ˈdʒɑn ˈsiːɡənθɔːlər ˈdʒunjəɹ/ ; German: John Seigenthaler jünger, pronounced [ˈjuːˈan ˈziːkənθɔːlər ˈdʒunjəɹ] ), is an American news anchor, most recently working for ClickNews.

Seigenthaler's video caught the attention of the recently re-elected Donald Trump, who only weeks before had dissolved The New York Times and Washington Post by executive order. Trump immediately posted a flurry of tweets eviscerating the venerable online encyclopedia. By the next day, Wikipedia was no more.

Let's avoid this sorry fate and make Wikipedia great again!

  1. ^ German Wikipedia also adopted the convention of preceding all death dates with a dagger (called a "Kreuz" in German), which has led to endless debates about whether or not the symbol is Christian and thus inappropriate to use for non-Christian biographies. Luckily, such a convention doesn't seem to exist in English encyclopedias!
  2. ^ Another famous example:
    Genghis Khan (English pronunciation:/ˈɡɛŋɡɪs ˈkɑːn/ or /ˈɛŋɡɪs ˈkɑːn/;[1][2]; Cyrillic: Чингис Хаан, Chingis Khaan, IPA: [tʃiŋɡɪs xaːŋ] ; Mongol script: , Činggis Qaɣan; Chinese: 成吉思汗; pinyin: Chéng Jí Sī Hán; probably May 31, 1162[3] – August 25, 1227), born Temujin (English pronunciation: /təˈmɪn/; Mongolian: Тэмүжин, Temüjin IPA: [tʰemutʃiŋ] ; Middle Mongolian: Temüjin;[4] traditional Chinese: 鐵木真; simplified Chinese: 铁木真; pinyin: Tiě mù zhēn) and also known by the temple name Taizu (Chinese: 元太祖; pinyin: Yuán Tàizǔ; Wade–Giles: T'ai-Tsu), was the founder and Great Khan (emperor) of the Mongol Empire, which became the largest contiguous empire in history after his death.
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Join the RfC in response to this article.A L T E R C A R I   06:59, 19 June 2017 (UTC) [reply]

Brilliant insight

Sometimes a problem is right in front of you, but you don't notice it until someone else points it out, at which point you see it everywhere. This essay is that sort of eye-opener. --Guy Macon (talk) 22:26, 8 May 2017 (UTC)[reply]

Exactly. thank you Kaldari for bringing this up. --Saqib (talk) 06:24, 12 May 2017 (UTC)[reply]

Solutions

Some {{infobox medical condition}} introduced a |pronounce= a while ago, which I think is a good solution. Alternate names/languages could be handled the same way in articles with infoboxes, e.g., as documented at Template:Infobox settlement#Name and transliteration.

Etymology is an endless problem (e.g., in anatomy articles), with some editors wanting it to be the first thing that you read, others wanting it last, and others not wanting it included at all. WhatamIdoing (talk) 18:53, 12 May 2017 (UTC)[reply]

Moving information to infoboxes seems like the right way to go. Inline parentheticals should be limited to what helps disambiguate the subject from plausible alternatives. – SJ + 22:16, 7 June 2017 (UTC)[reply]
It is key that the first sentence of English Wikipedia be in English as much as possible. Not sure which language pronunciations are written in, but it is not one I can read. Doc James (talk · contribs · email) 19:50, 8 June 2017 (UTC)[reply]
Yeah it's unfortunate that we waste the most valuable real-estate in the article for information that only 0.01% of readers are both interested in and can understand (don't quote me on that statistic). Kaldari (talk) 20:14, 8 June 2017 (UTC)[reply]
This is described as a metadata explosion issue. Wasn't Wikidata created as a solution to metadata surfacing issues? Maybe original language pronunciations etc. be toggled by the user, to go fetch them from Wikidata? By the way, I agree that this is a problem for many articles. - Bri (talk) 21:32, 8 June 2017 (UTC)[reply]
  • Are there two date typos in this article? 2034 and 2021 are not here yet. Otherwise, very good use of examples. Regards, JoeHebda • (talk) 02:34, 9 June 2017 (UTC)[reply]
  • I am in violent agreement with this op-ed. The fnords are out of control and we need to take what steps we can to rein them in. —David Eppstein (talk) 03:41, 9 June 2017 (UTC)[reply]
  • +1 @Kaldari. KISS principle! All the extra information would fit neatly into infoboxes for those who need/want it (and let it be pulled from Wikidata) rather than clogging up the lead. Make it happen number one. — billinghurst sDrewth 04:39, 9 June 2017 (UTC)[reply]
    • +1. Given Wikidata's motive to "collect structured data", I guess it stands as a natural and great choice to stop this metadata issue. This does require a progressive change as it affects the way people write/edit articles. This does require the consensus of the community. Hope it works! - - Kaartic correct me, if i'm wrong 10:23, 12 June 2017 (UTC)[reply]
  • This op-ed should be deepsixed. I was following along until it went into political April's Fools prank territory. This is a formal request to have this op-ed oversighted; just as the Trump/Wales April Fool's prank article ended up at Arb. Thank you. Cheers! {{u|Checkingfax}} {Talk} 05:36, 9 June 2017 (UTC)[reply]
  • Total agreement. I don't have a solution, but recognize the same problems and have also heard 'casual' users around me mention this at times. Maybe it would be interesting to do some sandbox experiments with alternate presentation styles of the same article. —TheDJ (talkcontribs) 06:26, 9 June 2017 (UTC)[reply]
    • For years, I have diverted the top pronunciation as footnote "[p]" (where "p" means "pronounce") to explain spoken form, and we could also have footnote "[d]" for long dates and places beyond years "1510-1588" plus footnote "[aka]" for aliases of "Ghengis Khan" in 4 other languages. So, this is a simple problem to fix, while adding extra details in footnotes or wikilinks for birthdate (like Jimbo's two birthdates), Elvis is "still alive" or nn% frequency of alias usage. We need to standardize pronunciation footnote "[p]" with "[d]" and "[aka]" (or such). -Wikid77 (talk) 22:13, 9 June 2017 (UTC)[reply]
  • We've had this problem for a long time on medical articles as well, and we've been trying to move this type of data into the infobox. It would be nice to look at some of those articles as well, to see if the trend is any different. Carl Fredrik talk 07:12, 9 June 2017 (UTC)[reply]
  • Couldn't agree more. You could also have mentioned the unsightly and patronising proliferation of the respell templates which means that even very obvious words often have their pronunciation glossed in two different ways. I would get rid of respell completely, as it serves no real purpose and clogs up the leads of articles. --John (talk) 11:21, 9 June 2017 (UTC)[reply]
  • Another +1 from me. If the purpose of this information is disambiguation, it should form part of the article title. If the purpose is to serve metadata then it belongs in the article's infobox. By using those two principles we can solve the problem completely. WaggersTALK 13:05, 9 June 2017 (UTC)[reply]
  • +1 I have been vigorously moving pronunciations of medications and diseases to the infobox. I have been moving etymology to the end of the lead or to the body. The first sentences of our article MUST be in English as much as possible on English Wikipedia. All pronunciations belong outside of the first sentence (in the infobox maybe) for all topics IMO. So do non English spellings. Only common alternative names and at most two should be allowed in the first sentence. Maybe we need a stronger policy about readable language? Doc James (talk · contribs · email) 15:11, 9 June 2017 (UTC)[reply]
  • This is described as a metadata explosion issue. Wasn't Wikidata created as a solution to metadata surfacing issues? Maybe original language pronunciations etc. be toggled by the user, to go fetch them from Wikidata? I agree that this is a problem for many articles. - Bri (talk) 15:29, 9 June 2017 (UTC)[reply]
  • I don't see it as a problem - Once I see useless info I just skip to the next paragraph. As long as one paragraph doesn't have semi-useless info in the middle (but rather always at the end) it works OK. Ariel. (talk) 19:18, 9 June 2017 (UTC)[reply]
    • That seems like a dangerous assumption to make. All of the examples given in the article have material information in the first paragraph following the bracketed nonsense. And I'd argue that the fact that we have mostly-useless information that readers have to skip over is a problem in itself. Caeciliusinhorto (talk) 22:19, 9 June 2017 (UTC)[reply]
  • I sense this is an issue that deserves a more extensive discussion that the Talk page of a Signpost article really isn't the best place for. Would someone like to start a thread at WP:Village pump, where it will attract input from a wider group? -- llywrch (talk) 21:37, 9 June 2017 (UTC)[reply]
  • +1 Send to Infobox. Failing that, to a later paragraph Jim.henderson (talk) 23:14, 9 June 2017 (UTC)[reply]
  • Strongly agree - Doc James' solution is correct. But often there isn't an infobox, or you don't want it there, and most leads are too short, so a quick para at the end of the lead can hold it. Johnbod (talk) 02:34, 10 June 2017 (UTC)[reply]
  • I would favour reducing this to years only for biographies. For towns, significant alternative names are fine, but pronunciation guides could be moved to infoboxes. All the best: Rich Farmbrough, 12:41, 10 June 2017 (UTC).[reply]
  • +1 infobox, -1 footnote. Keep it as structured as possible. Widefox; talk 17:42, 10 June 2017 (UTC)[reply]
  • I agree wholeheartedly with the sentiment expressed in the editorial. I also think Wikid77's idea of using explanatory footnotes is a great solution. — Malik Shabazz Talk/Stalk 20:01, 10 June 2017 (UTC)[reply]
  • Also agree it's a problem. The dates of birth and death are useful, and an occasional "must have" tidbit) but otherwise it belongs elsewhere than the lead. Jason Quinn (talk) 10:02, 11 June 2017 (UTC)[reply]
  • Back in April, I went through all the FAs and GAs I'd worked on and moved much of the "clutter" to explanatory notes. No one's objected, and I think they look and read much better. Ealdgyth - Talk 22:32, 11 June 2017 (UTC)[reply]
  • Agreed with the sentiment here. I suggest a proper RfC and then we can start to trim the lead sentences back to shape. The metadata could be moved to infoboxes, or section on, uh, 'alternative names and related details'. Btw, I support keeping day and month of the birth and place for bio articles, it seems fine. No need for all the alt names, pronunciation, etc. --Piotr Konieczny aka Prokonsul Piotrus| reply here 03:13, 12 June 2017 (UTC)[reply]
  • Spot on. Infoboxes are the place for this sort of thing. If only there weren't a minority opposed to using them in some articles. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 09:09, 12 June 2017 (UTC)[reply]
  • Glad to know I'm not the only one bugged by this. Move to infobox sounds like a good idea. - DavidWBrooks (talk) 17:13, 13 June 2017 (UTC)[reply]
  • Agree. Superfluous info in lead sentence can be placed in the infobox, in the second sentence/paragraph of the lead, or in the body of the article. Also, both infobox and body might be used, and the metadata in the body could be expounded upon when appropriate (to address the "growing metadata" concern), some of which might not be needed in the infobox.  Paine Ellsworth  put'r there  01:22, 15 June 2017 (UTC)[reply]
  • Seconding Doc James's comments earlier above. He and I have worked through this issue on WP:MED articles to the point where all you need is any sensible combination of (1) infobox parameter values (for example, Synonyms, OtherName, Pronunciation, Pronounce, various date parameters) and (2) a section down lower in the article (for example, Etymology and pronunciation, History, Nomenclature, or Society and culture, depending on the instance), and you can include *all* information without clogging the lede *and* without sacrificing anything. Quercus solaris (talk) 02:46, 15 June 2017 (UTC)[reply]
  • From a wide, reader-friendly (and sleeker metadata) outlook, I wholeheartedly and way thoroughly back the pith of this well-written op-ed. Give them a smooth shave by all means. But. I've always had a half-crazy, wonkish, fawnish fondness for those long and winding roads, those parenthesis-sliced, comma-delimited, italics-littered, IPA-riddled, hyperlink-hived lead sentences on en.WP, which I find even more fun in biographical articles. I'm down for 'em like nifty-quick little crossword puzzles! Gwen Gale (talk) 08:47, 16 June 2017 (UTC)[reply]
  • What this op-ed terms a "metadata virus" looks like a simple tendency to include as many translations of a name as possible, and is often easily solved using {{Infobox Chinese}} (misleadingly named, as it has options for all sorts of other languages). I'm ignoring the hypothetical example because that's never going to happen. Jc86035 (talk) Use {{re|Jc86035}}
    to reply to me
    17:42, 16 June 2017 (UTC)[reply]
  • I have seen quite a lot of activity in the past year of editors cutting lead sentences down to size. In fact I imagined that we'd already reached a consensus to have nothing more than dates in the parenthesis, so I use infoboxes for other stuff. If that's not already the standard, it should be, as the article implies. Chiswick Chap (talk) 16:07, 19 June 2017 (UTC)[reply]
  • (copy of my RfC !vote) strongly support - more and more of a readers use mobile (up to 60% now) - clutter in the first sentences means absurd amounts of scrolling to get to anything meaningful. As a result the WMF started inserting the description field from Wikidata (unsourced, mostly unpatrolled) at the start of articles, to give readers a sense of what the article is about with reasonable efficiency. Vandalism that appeared in an article via the description field led to this ANI thread, which led to this RfC to ask WMF to stop using WIkidata this way (succeeded and done), which led User:Dank to open this thread at WT:FAC to make tight lead sentences part of FAR. This is important - we should not have clutter in the first 2.5 paragraphs - we have a responsibility to keep these sentences focused on content that summarizes the article. We have infoboxes and sections below for the details like etymology, pronunciation, alt names, etc.Jytdog (talk) 14:49, 3 July 2017 (UTC)[reply]

















Wikipedia:Wikipedia Signpost/2017-06-09/Op-ed