An article[1] in sociology journal The Information Society looks at interactions between Wikipedia editors and the project's governance, visible in the articles on stem cells and transhumanism, and in the analysis of Wikipedia's discussion of userboxes, all through the prism of Jürgen Habermas' universal pragmatics and Mikhail Bakhtin's dialogism theories.
The authors focus on the qualitative analysis of language used by editors, to argue that Wikipedia has elements of a democracy, and is an example of a Web 2.0–empowering discourse tool. They stress that some forms of discourse found online (including on Wikipedia) may be highly irrational, something that some previous arguments that Web 2.0 is a democratic space have often ignored, but they argue that this is in fact not as much of a hindrance as previously expected. Cimini and Burr remark that discourse can develop between Wikipedians of widely differing points of view, and that some editors will engage in "repeated, strategic, and often highly manipulative attempts" to assert personal authority. Such discussions may be very lively, involving "personal, emotional, or humour-based arguments", yet the authors argue that such comments may not be a hindrance; instead, "on many occasions, there is thus a clearer exposition of views that is achieved, in spite of, or perhaps because of, these personal [and] sometimes vulgar methods of argumentation."
In the end, the authors are positive about the success of Wikipedia's deliberation in reaching consensus, although they say that it can be "fleeting and transitory" on occasion. Unfortunately, the paper does not touch on Wikipedia policies such as Wikipedia:Civility and Wikipedia:No personal attacks, which would certainly have added to their analysis.
Despite the paper's claim to have received approval for research through a university research ethics committee, the paper does critically discuss the postings of specifically named editors ("[Editor A's] claim to authority and ad hominem attacks were met with derision by [Editor B]" (names replaced by the Signpost); this may raise eyebrows. Not all editors are 100% anonymous, which raises the question of whether the researchers did enough to protect the identity and reputation of the editors it cites. At the very least, why weren't the editors' usernames changed in the quotes? Their direct identification adds nothing to the article, and may expose the users to attack. (Similar questions have been discussed in the past by members of the Wikimedia Foundation Research Committee.)
In a paper presented at the 4th International Conference on Intercultural Collaboration (ICIC),[2] Kulkarni et al. offer a simple approach to support the work of Wikipedia editors who maintain articles concerning the same topic in multiple language versions. The long-term goal is to implement a bot that supports these specialized users by highlighting missing attributes and content inconsistencies.
The analysis was focused on a pairwise comparison of infoboxes in different languages. First, the attribute-value pairs were extracted from the infoboxes and translated into English via Google translate. The identification of matching attribute names was achieved through direct text comparison with a set of synonyms obtained from WordNet (this step was included to handle mismatches caused by translation errors and variations). In a second step (the matching of attribute-values) the authors again used direct text comparative methods, and checked whether the values could be identified as homophones, to exclude mismatches caused by spelling mistakes in the text.
The evaluation data-set of these analyses and the whole pipeline included articles from English, German, Chinese and Hindi Wikipedias concerning two restricted domains: Indian cities and US-based companies. The evaluation revealed "a significant increase in recall after the concepts of homophones and synonyms were applied in addition to the direct text comparison." But the overall result was very weak, mainly due to translation errors. The authors noticed syntactic and semantic differences between the infoboxes, such as paraphrasing or different fact representations. "Also, abbreviations, unit conversion and geographic location matching [was not handled by their system]." The researchers plan to improve the system by addressing all of these issues in turn.
An undergraduate computer science honors thesis at Trinity University (Texas) constructs a semantic graph from 451 articles, linked to from the World War II article.[3] Ryan Tanner's goal is to produce a visualization "which allows one to quickly find and examine connections between the people, places and things described in Wikipedia". The process is as follows:
Originally the goal was to visualize the whole of Wikipedia; however, due to problems with the dump, only 250,000 articles out of about 1.5 million were imported. An even smaller subset was ultimately usable, since the Stanford NLP library crashed on many of the remaining articles due to markup issues and the need for manual cleanup. To ensure a dense graph, tests were focused on the network of the World War II article. Some brief examples of the resulting graph are given in Chapter 10, which notes false positives as one problem requiring further investigation. The author makes suggestions for future research, such as using the Simple English Wikipedia or more complex relations.
A paper titled "Leading the Collective: Social Capital and the Development of Leaders in Core-Periphery Organizations"[4] looks at how leaders emerge in Wikipedia and similar crowd-based organizations. While often seen as egalitarian and with little hierarchy, such projects always have a group of leaders who have emerged from the community (the "crowd"), involved in planning, mediation, and policy development. The authors treat Wikipedia and similar organization as a core–periphery network model developed by Steve Borgatti—a system with a deeply interconnected center and a poorly connected periphery. In Wikipedia, the leaders ("core") comprise the most active contributors, and the authors assume they produce the most social capital. Using social network analysis, the paper looks at the interpersonal ties between the editors, focusing on the ties between leaders and periphery. The hypothesis is that specific types of ties will have a greater influence on advancement to leadership.
The authors collected data from RfA pages, and the ties were measured through user-talk-page interactions. Leaders were defined as admins, and periphery editors as non-administrators; this operationalization may raise some doubts about the validity, since some very active and prominent members of the community are not admins, something the authors do not address. The authors find that the most important ties are the early ones to the periphery, and later, ties to the leaders. Overall strong ties are not as important as weak ties, although Simmelian ties (between pairs of leader groups) are among the most important.
Collier and Kraut conclude that leaders in projects such as Wikipedia do not suddenly appear; instead, they evolve over time through their immersion in the project's social network. Early in their experience, those leaders gain a deeper understanding of the community, developing a network of contacts through their weak ties to the periphery; later, their most important ties are to the leaders, particularly in the form of strong connection to a leader group.
A paper[5] presented at an international conference on intercultural collaboration aims "to identify the type of community interaction needed for successfully creating or amending an article via Wikipedia translation activities", and proposes new software tools to facilitate these interactions. To this end, the researchers from Kyoto University analyzed 1694 talk-page comments from three Wikipedias, belonging to articles in categories marking (partial or complete) translations (e.g. fr:Catégorie:Projet:Traduction/Articles_liés): 228 articles from the Finnish, 93 from the French, and 94 from the Japanese Wikipedia. They attempted to categorize (code) each comment according to which "activity" it referred to (either editing the article or translating it), about which "context" it was referring to (using the categories "content", "layout", "sources", "naming", "significance" and "wording"), and which action was intended (requesting or providing help, requesting an edit, announcing an edit that the user had made, criticizing the article without a direct request for action, coordinating actions between users, or referring to an established Wikipedia policy).
Regarding comments focused on the activity of editing, the "results were consistent with previous research, with a high frequency of discussion contributions about content and layout". The authors found that "the Japanese Wikipedia was the only one with more discussion contributions about layout than content when the discussion was about editing activities (40.18%)" and speculate that this is because "in the older, or larger, Wikipedias, practices and policies are likely to be better established than in the younger, or smaller, Wikipedias leading to a lower frequency of discussions about layout." (However, they later point out that the Finnish Wikipedia, rather than the Japanese, is the smallest and youngest among the three examined ones, noting that it shows a much higher frequency of discussion about policy—15.0%, versus 6.0% on the French and 3.3% on the Japanese Wikipedia.) In this class of comments, "discussions about citing sources were relatively common in the Finnish and French Wikipedias (18.8% and 12.4%, respectively). In the Japanese Wikipedia, sources were less common with 7.1% of all discussion contributions regarding editing activities."
Most discussions about translation activities were about naming—that is, "resolving the proper form for the title of the article, section or sub-section, names or proper nouns, and transliteration in the corresponding article", contrasting the researchers' initial hypothesis that such discussion would "have a high frequency of contributions regarding translation of specific words and expressions" (their "naming" category "does not include phrasing or resolving proper translation of individual words or expressions"). As one reason, they identify "the diversity in naming practices of events between different language sources, such as mass media. Especially in the Finnish Wikipedia, discussion about sources was common (16.15%). These two topics are loosely related, as direct translations of the names of well-known events are often not acceptable in the target language Wikipedia."
Having identified naming issues and the search for suitable sources in the target language as "key problems" emerging in the translation discussions, the authors conclude that "the current approaches for supporting Wikipedia translation are not necessarily solving the main problems in Wikipedia translation" and proceed to suggest two "directions for designing supporting tools for Wikipedia translation, especially through open source development of MediaWiki extensions":
The paper makes references to previous work on Wikipedia translation (including the authors' own), but does not mention the EU-supported CoSyne project, which aims to integrate tools with MediaWiki that "automate the dynamic multilingual synchronization process of Wikis" and would seem to have a lot of overlap with the kind of tools discussed in the paper.
A paper[6] by three researchers affiliated with the EU-supported RENDER project (to be presented at next month's "Hypertext 2012" conference) promises "accurate revert detection in Wikipedia". The article starts by describing the detection of reverts as "a foundational step for many (more elaborated) research ideas, [whose] purposeful handling leads to a superior understanding of wiki-like systems of collaboration in general", giving an overview over such research. (Revert detection has also been used in tools for the use of the editing community, such as this one that identify articles on the German Wikipedia that are currently controversial.)
Overviewing the "state-of-the-art in revert detection", the authors criticize the prevalent "identity revert detection method" (SIRD) which relies on finding identical revisions using MD5 hashes, arguing that it does not fully match the definition of a revert in the (English) Wikipedia's policies at Wikipedia:Reverting: The SIRD method "does not require the reverting edit to actually undo the actions of an edit identified as reverted ... [Furthermore, it] is not possible to indicate if the reverting edit fully, partly or not at all undid the actions of the reverted edit ... It also does not require the intention of the reverting edit to revert any other edit." (Still, mainly due to requests by researchers, MD5 hashes have been integrated directly into the revision table stored by MediaWiki recently, necessitating considerable technical efforts when updating the existing databases for Wikimedia projects.)
The paper then presents the authors' new method for revert detection, which still aims to detect full reverts and to avoid false positives, while coming closer to the Wikipedia community's definition. It is implemented as an algorithm based on splitting the revisions' wikitext into word tokens (and made available online as a Python script). Also, MD5 hashes are still used on a paragraph level to be able to detect unchanged paragraphs easily and speed up computation. The algorithm was then evaluated by a panel of Wikipedians recruited on the English Wikipedia in comparison with the existing SIRD method.
As summarized by the authors, this user study found the new method to be "more accurate in identifying full reverts as understood by Wikipedia editors. More importantly, our method detects significantly fewer false positives than the SIRD method [27% in the sample, which however was somewhat small]". As a drawback, the authors note "the increased computational cost. As [the new algorithm] is quadratic over the number of words in the DIFFs [the changed text between subsequent revisions], in its current implementation it might not be the tool of choice if larger amounts of articles are to be analyzed; especially in the case of complete history dumps of the large Wikipedias, e.g., English, German or Spanish."
Discuss this story
@automatic detection of inconsistencies - active and useful projects
Bulwersator (talk) 16:15, 30 May 2012 (UTC)[reply]