The Signpost
Single-page Edition
WP:POST/1
3 September 2014

Arbitration report
Media viewer case is suspended
Featured content
1882 × 5 in gold, and thruppence more
Op-ed
Automated copy-and-paste detection under trial
Traffic report
Holding Pattern
WikiProject report
Gray's Anatomy (v. 2)
Recent research
A Wikipedia-based Pantheon; new Wikipedia analysis tool suite; how AfC hamstrings newbies
 

Wikipedia:Wikipedia Signpost/2014-09-03/From the editors


2014-09-03

Holding pattern

This week we saw three of the top ten articles remain in place, with the Ice Bucket Challenge at #1, Amyotrophic lateral sclerosis at #2, and Islamic State of Iraq and the Levant at #5, all for a second straight week. The death of English actor Richard Attenborough was apparently the most notable of the week, as that article entered the list at #3. Top news subjects of recent weeks, including Ebola virus disease (#7) and Robin Williams (#9), also continued to remain popular.

For the full top 25 list, see WP:TOP25. See this section for an explanation for any exclusions.

For the week of 24 to 30 August 2014, the 10 most popular articles on Wikipedia, as determined from the report of the 5,000 most viewed pages, were:

Rank Article Class Views Image Notes
1 Ice Bucket Challenge Start-class 1,773,522
Number 1 for the second week in a row. This global viral phenomenon to arise awareness and funding for research on ALS was not launched by any particular charity, but seems to have grown on its own. While it certainly has achieved its goals, some have criticized the whole movement as feeling more like an act of slacktivism by many participants. But most viral phenomena have absolutely no redeeming social value (has Grumpy Cat raised millions for disease research?), so things could be much worse. Wikipedia did its part to keep things focused on substance by deleting the celebrity-fest page "List of Ice Bucket Challenge participants" on 29 August, after a lengthy deletion debate.
2 Amyotrophic lateral sclerosis B-Class 880,652
Like #1, it's #2 for the second week in a row.
3 Richard Attenborough B-class 794,061
This popular English actor died on August 24, at age 90. Attenborough won two Academy Awards as director and producer for Gandhi in 1983. He also won four BAFTA Awards and four Golden Globe Awards during his career. As an actor, memorable appearances included roles in Brighton Rock (1947), The Great Escape (1963), 10 Rillington Place (1971), and Jurassic Park (1993). He is survived by his wife of almost 70 years, former actress Sheila Sim.
4 Ariana Grande C-Class 589,596
Up from #19 last week, the popular singer released her second album, My Everything, on August 25.
5 Islamic State of Iraq and the Levant C-class 448,261
Holding steady at #5 for a second week. This almost absurdly brutal jihadist group proudly posts mass executions it carries out on Twitter, and has been disowned even by al-Qaeda. The recent execution of journalist James Foley is among the reasons for the continued popularity of this article.
6 Deaths in 2014 List 361,006 The list of deaths in the current year is always a popular article. Deaths this week included Leonid Stadnyk (August 24), a Ukrainian formerly listed by Guinness as the tallest man in the world; Swedish comic strip artist Lars Mortimer (August 25); Nigerian pastor Samuel Sadela (pictured at left), unverified claimant to being the oldest male alive (August 26); American particle physicist Victor J. Stenger (August 27); Former Soviet spy John Anthony Walker (August 28); Singaporean comedian David Bala (August 29); and 18-year old Belgian cyclist Igor Decraene, who died in a train accident (August 30).
7 Ebola virus disease B-class 356,594
The 2014 West Africa Ebola outbreak continues to draw attention to this horrific disease.
8 Pseudoscorpion C-class 334,956
Reddit noted this week that "tiny pseudoscorpions (about 4mm) live inside old books, effectively protecting them by eating booklice and dustmites", a hook exciting enough to make reddit put this in the top 10 this week.
9 Robin Williams B-class 332,653
Down from #3 last week. The unexpected death by suicide of this iconic comic on August 11 led to one of the highest spikes in views since this project began.
10 Facebook B-class 328,386
Usually a fairly popular article; a slower news week allowed it to percolate back up into the Top 10 for the first time in a while.

Wikipedia:Wikipedia Signpost/2014-09-03/In the media Wikipedia:Wikipedia Signpost/2014-09-03/Technology report Wikipedia:Wikipedia Signpost/2014-09-03/Essay Wikipedia:Wikipedia Signpost/2014-09-03/Opinion Wikipedia:Wikipedia Signpost/2014-09-03/News and notes Wikipedia:Wikipedia Signpost/2014-09-03/Serendipity

2014-09-03

Automated copy-and-paste detection under trial

One of the problems Wikipedia faces is users who add content copied and pasted verbatim from sources. When we follow up on a person's work, we often don't check for this, and a few editors have managed to make thousands of edits over many years before concerns are detected. In the past year, I've picked up three or four editors who have made many thousands of edits to medical topics in which their additions contain word-for-word copy from elsewhere. Most of those who only make a few edits of this nature are usually never detected.

After a user detects this kind of editing, clean-up involves going through all their edits and occasionally reverting dozens of articles. Unfortunately, sometimes it means going back to how an article was years back, resulting in the loss of the efforts of the many editors who came after them. Contingency reverts can end up harming overall article quality and frustrate the core editing community. What is the point of contributing to Wikipedia if it's simply a collection of copyright-infringed text cobbled together, and even your own original contributions disappear in the cleanup? Worse, the fallout can cause editors to retire. If we could have caught them early and explained the issues to them, we'd not only save a huge amount of work later on, but might retain editors who are willing to put in a great deal of time.

So what is the solution? In my opinion, we need near real-time automated analysis and detection of copyright concerns. I'd been trying to find someone to develop such a tool for more than two years; then, at Wikimania in London, I managed to corner a pywikibot programmer, ValHallASW, and convinced him to do a little work. This was followed by meeting a wonderful Israeli instructor from the Sackler School of Medicine Shani Evenstein who knew two incredibly able programmers, User:Eran and User:Ravid ziv. By the end of Wikimania our impromptu team had produced a basic bot – User:EranBot – that does what I'd envisioned. It works by taking all edits over a certain size and running them through Turnitin / iThenticate. Edits that come back positive are listed for human follow-up. Development of this idea began back in March of 2012 by User:Ocaasi and can be seen here.

Why near real time?

Determining copy-and-paste issues becomes more difficult the longer one waits between the initial edit and the checking, as one then has to deal with mirroring of Wikipedia content across the Internet. As well, many reliable sources – including peer-reviewed journals and textbooks – have begun borrowing liberally from Wikipedia without attribution. So if we're looking at copyright issues six months or a year down the road, we need to look at publication dates and go back in the article history to determine who is copying from whom.

In short, it's far more difficult for both humans and machines.

Why Turnitin?

Turnitin is an Internet-based plagiarism-prevention service created by iParadigms, LLC, first launched in 1997; it is one of the strategies used by some universities and schools to minimise plagiarism in student writing. The company that developed and owns the product has agreed to give us free access to their tools and API. Even though it's a for-profit company, there won't be obtrusive links from Wikipedia to their site, and no advertising for them will ever appear on Wikipedia.

Why would they want to be involved with us? Letting us use their tools doesn't cost them anything and is no disadvantage to shareholders. Some companies are willing to help us just because they like what we do. We've had a number of publishers donate large numbers of accounts to Wikipedians for similar reasons. They have extra capacity just sitting there, so why not give it away? They also know we're volunteers and are not going to buy their capacity anyway. Other options could include Google, but they don't allow their services to be used in this way, and it appears that Yahoo is currently charging for use by User:CorenSearchBot, which checks new articles for issues.

Benefits

How many edits are we looking at? Currently the bot is running only on the English Wikipedia's medical articles. In 2013, there were 400,000 edits to medical content – around 1,100 edits per day. Of these only about 10% are of significant size and not a revert, so we're looking at an average of around maybe 100 edits per day. If we assume a 10% rate of copyright concerns and three times as many false positives as true positives, we're looking at 40 edits per day at most. Who would follow-up? With the number of concerning edits in the range of 40 per day, members of WikiProject Medicine will be able to handle the load. This is much easier than catching 30,000 edits of copyright infringement after the fact, with clean-up taking many of us away from writing content for many days.

The Wiki Education Foundation has expressed interest in the development of this tool, since edits by students have previously contained significant amounts of plagiarism, kindling much discontent with Wiki Education's predecessor. The Hebrew Wikipedia is also currently working with this bot, and we'd be happy to see other topic areas and WMF language sites use it.

There are still a few rough aspects to iron out. The parsing out of the new text added by an edit is not as good as it could be. Reverts should be ignored. These issues are fairly minor to address, and a number have already been dealt with. While there were initially about three false positives for every true positive, we should have this down to a more even 50–50 split by the end of the week. Already in its early stages, this has turned out to be an exceedingly useful tool.

The views expressed in this opinion piece are those of the author only; responses and critical commentary are invited in the comments section. Editors wishing to propose their own Signpost contribution should email the Signpost's editor in chief.

Wikipedia:Wikipedia Signpost/2014-09-03/In focus

2014-09-03

Media viewer case is suspended

On 1 September, the Arbitrators voted to suspend the Media Viewer case for 60 days. After the suspension period is up, the case is to be closed unless the committee votes otherwise. The case suspension comes in response to several new initiatives and policies announced by the Wikimedia Foundation that may make the case moot. In the same motion, the committee declared that Eloquence's resignation of the administrator right was "under a cloud" and that he can only regain the right through another RfA.

Audit Subcommittee appointments

The Arbitrators voted to appoint Callanecc (talk · contribs), Joe Decker (talk · contribs) and MBisanz (talk · contribs), with DeltaQuad (talk · contribs) as the alternate, to the 2014 Audit Subcommittee. Wikipedia:Wikipedia Signpost/2014-09-03/Humour

If articles have been updated, you may need to refresh the single-page edition.

















Wikipedia:Wikipedia Signpost/Single/2014-09-03