A draft of a letter, submitted for publication, has been posted on ArXiv.[1] The letter reports research on modeling the process of collaborative editing in Wikipedia and similar open-collaboration writing projects. The work builds on previous research by some of its authors on conflict detection in Wikipedia. The authors explore a simple agent-based model of opinion dynamics, in which editors influence each other either by direct communication or by successively editing a shared medium, such as a Wikipedia page. According to the authors, the model, although highly idealized, exhibits a rich behavior that can reproduce, albeit only qualitatively, some key characteristics of conflicts over real-world Wikipedia pages. The authors show that, for a fixed editorial pool with one "mainstream" and two opposing "extremist" groups, consensus is always reached. However, depending on the values of the model's input parameters, achieving consensus may take an extremely long time, and the consensus does not always conform to the initial mainstream view. In the case of a dynamic group, where new editors replace existing ones, consensus may be achieved through a phase of conflict, depending on the rate of new editors joining the editorial pool and on the degree of controversy over the article's topic.
In a copyright panel at this month's Wikimania, Abhishek Nagaraj – a PhD student and economist from the MIT Sloan School of Management – presented early results from an econometric study of copyright law. The study used data from the English Wikipedia's WikiProject Baseball to try to consider how gains from digitization are moderated by the effects of copyright. Previous work on the economics of copyrights have struggled to disentangle the effects of copyright with the effects of increased access that often coincides with content after it has entered the public domain.
The paper takes advantage of the fact that in 2008, Google digitized and published a large number of magazines as part of the Google Books projects. Among other magazines published were 70 years of back-issues of Baseball Digest, a magazine that publishes baseball stories, statistics, and photographs. Measuring the effect of digitization, Nagaraj found that the articles on baseball All-Stars from between 1944 and 1984 saw large increases in size (5,200) around the period that the digital Google Books version of Baseball Digest became available. However, because of the law governing copyright expiration, all the issues of Baseball Digest published before 1964 were in the public domain, while issues published after were not. Using the econometric difference in differences technique, Nagaraj compared the different effects of digitization for (1) players who began their professional baseball career after 1964 and as a result had no new digitized public-domain material and (2) players who had played before and were thus more likely to have digitized material about them enter the public domain.
In terms of the effect of copyright, Nagaraj found no effect on the length of Wikipedia articles on public domain status but found a strong effect for images. Wikipedia writers could, presumably, simply rewrite copyrighted material or may not have found the Baseball Digest form appropriate for the encyclopedia. However, Nagaraj found that the availability of public domain material in Baseball Digest led to a strong increase in the number of images. Before Google Books published the material, the pre-64 group had an average of 0.183 pictures on their articles and the post 64 group had about 0.158 pictures. In the period after digitization, both groups increased but the older group increased more, to 1.15 pictures per article as opposed to 0.667 images for the more recent players whose Baseball Digest material was still under copyright. Nagaraj also found that those players with public domain material have more traffic to their articles. The essay controls for a large number of variables related to players, their performance and talent, and their potential popularity, as well as for trends in Wikipedia editing.
The presentation slides are available on the Wikimania conference website[2] and a nice journalistic write-up was published by The Atlantic.
Field notes can be a valuable source of information about meteorological, geological and ecological aspects of the past, and making them accessible by way of Wikisource-based semantic annotation was the focus of a recent study[3] published in ZooKeys as part of a special issue on the digitization of natural history collections. The paper described how the field notes of Junius Henderson from the years 1905–1910 have been transcribed on Wikisource and then semantically annotated, as illustrated in the screenshot. Henderson was an avid collector of molluscs and, while trained as a judge, served as the first curator of the University of Colorado Museum of Natural History. His notebooks are rich in species occurrence records, but also contain occasional gems like this one from September 3, 1905:
“ | Train again so late as to afford ample opportunity for philosophic meditation upon the motives which inspire railroad people to advertise time which they do not expect to make except under rare circumstances | ” |
The article provides a detailed introduction to the workflows on the English Wikisource in general and to WikiProject Field Notes in particular, which is home to transcriptions of other field notes as well. The data resulting from annotation of the field notes are available in Darwin Core format under a Creative Commons Public Domain Dedication (CC0). This work ties in with discussions that took place at Wikimania about the future of Wikisource, the technical prerequisites and existing tools and initiatives.
The quality of medical information in Wikipedia could be vastly improved, based on the results of a recent study of 24 articles in pediatric otolaryngology[4] (more commonly referred to as "ear, nose, and throat" or ENT). The study compared results on common ENT diagnoses from Wikipedia, eMedicine, and MedlinePlus (the three most popular websites, by their determination) and they found that Wikipedia's articles on ENT were the least accurate and had the most errors of the three and that they were in the middle of the other two in regards to readability.
While one of the most referenced sources in this area, Wikipedia had poor content accuracy (46%) compared to the two other frequent sources. MedlinePlus has comparable (49%) accuracy, but was missing 7 topics. The clear leader in accuracy, eMedicine, suffers from a higher reading level. The study provides specific criteria, in section 2.3, which could be considered for evaluation of existing articles. One limitation of the study is that, while suggesting that Wikipedia "suffers from the lack of understanding that a physician-editor may offer", it does not point to information on how to get involved with Wikipedia. Engagement with the pediatric medicine community would be beneficial, especially since about 25% of parents made decisions about their children's care in part based on online information.
A forthcoming paper at this year's WikiSym conference investigates the emotions expressed in article and user talk pages.[5] "Administrators tend to be more positive than regular users", and the paper suggests that "as women gain experience in Wikipedia they tend to adopt the emotional tone of administrators", for instance linking to policy at more than twice the rate as males. Due to the likelihood of women to interact with other women, they suggest gender-aware recruiting to address the gender gap.
The authors point out the utility of positive emotion in keeping discussions on track, and suggest that experienced editors should be encouraged to maintain a positive climate. To determine users' gender, they used a crowd-sourced study through Crowdflower. Emotions are determined using the ANEW wordlist which distinguishes the range of emotional variability, based on valence, arousal, and dominance. The paper notes that policy mentions tend to have "a remarkably positive and dominant tone, and with stronger emotional load than in the rest of the discussion'".
A paper from the University of Alberta addresses the difficulty of analyzing edit histories and finding conflict in particular.[6] They use terms indicating content-based agreement (e.g. "add", "fix", "spellcheck", "copy", and "move") and disagreement ("uncited", "fact", "is not", "bias", "claim", "revert", and "see talk page"). They define conflicting interactions as those that revert, or delete content, or use more negative terms than positive terms. They find that this is a useful way to identify controversial articles.
A student paper for a course on "Project in Mining Massive Data Sets" at Stanford University, titled "Wikipedia Mathematical Models and Reversion Prediction"[7] tries to use mathematical models "to explain why the amount of [editors on the English Wikipedia] stops increasing, whereas the amount of viewers keeps increase", and "to predict if an edit will be reverted." The researchers used Elastic MapReduce on Amazon's servers to carry out this research. The paper is a bit confused since the researchers are more interested in models and validation than explaining the phenomena.
The first part of the paper includes two models for examining the relation of visitors to editors in Wikipedia's community. The first model makes the assumption that editors act as predators and articles have the role of prey. However this model did not fit the data. The second model used a linear regression between a number of factors which allow the authors to model the community's statistics over time. The model is then tested using simulation and seems to present accurate results.
In the second part of the paper, three models were used to predict which edits will get reverted. The models were trained using 24 features, classified either as edit, editor or article based. E.g. an article's age; its edit count; number of editors participating in editing; number of articles the editor has edited; change in information compared to previous status. The outcome of the prediction which used three machine learning algorithms achieved about 75% accuracy and another interesting conclusion was that the ability to detect reversion has not changed much over time.
Discuss this story
Why does the number of Wikipedia readers rise while the number of editors doesn't?"
If Standford University wanted to know "Why does the number of Wikipedia readers rise while the number of editors doesn't?" all they had to do was look at the nuclear power industry. Our site is like the power station, with the editors as the fuel rods and the guidelines, policies, bureaucracy, etc, as the control rods. Our problem on site is the the editors are increasingly frustrated by the control rods, which seem to sink further into the reactor each year and as a result of the control rod insertion more and more editors are experiencing the difficulty of having to work harder to get the article material heated to acceptable levels. Those at the top of the reactor have already experienced a total retardation of the nuclear fission process, while those at the bottom are unable or unwilling to pick up the slack. Despite this disturbing trend it does not effect the readers, who are outside the reactor's water loop and thus interact with the articles only in the heat exchanger, and as long as there is sufficient energy to boil the water - or in this case, to be more precise, maintain the articles and add new ones (even at a reduced rate) - the readers in the power loop will continue to power the machine that keeps Wikipedia moving. TomStar81 (Talk) 14:28, 31 July 2012 (UTC)[reply]
Why would anyone expect any correlation between the number of Wikipedia readers and editors, or their respective rates of change? Individuals read Wikipedia to obtain information. Individuals edit Wikipedia for a wide variety of reasons. There is no causal relationship between the numbers of readers and editors, and therefore no reason to expect numerical correlation.—Finell 19:24, 1 August 2012 (UTC)[reply]
bug
It looks like something's broken in the mediawiki handling of English->Thai interwiki links, because the wikicode
disappears entirely, causing the fuzzy logic paragraph in this article to have this mangled phrase:
-R. S. Shaw (talk) 21:30, 31 July 2012 (UTC)[reply]
Thai Featured Article study and overfitting
There are only 91 featured articles in the Thai wikipedia, 88 of which used by the study. I'm not sure that's really a good enough sample size to get good results. (Okay, sure, there are 75,000 Thai WP articles total, which is a good sample set, but they picked only 100 "normal" articles.) The fact that their algorithm caught *all* the FAs makes me a bit suspicious too - it's easy to make a model catch everything with lots of specific hacks, but it's not clear if you get a good model going forward - overfitting. (Think of weird edge case FAs in English WP promoted in 2007 with cleanup tags in the middle of a FAR - it'd be weird for a non-overfit / non-super-generous model to mark it as featured, so some error rate is "good.") If they'd had, say, 400 featured articles to play with, and fed 300 of them into the corpus + 10K non-featured articles, and then had to guess on the remaining 100 FAs mixed in with a different 10K normal articles, then the results might have been interesting. As it stands, alas. I'd also want to see a very low rate of false positives ideally since so many "normal" articles are easy to rule out just on basis of, say, footnote count; an algorithm that could tell the difference between articles with lots of footnotes because they're radioactively controversial recent events vs. ones with lots of footnotes because they're featured.
Obligatory Nate Silver link: http://fivethirtyeight.blogs.nytimes.com/2011/03/24/models-can-be-superficial-in-politics-too/ . (Nice & simple overfitting explanation with examples, although presidential elections have an even tinier sample size.) SnowFire (talk) 17:58, 2 August 2012 (UTC)[reply]
Admin, emotion, women..what?