Neuigkeiten aus dem Wikipedia-Projekt

jakob — Tue, 02 Oct 2007 10:23:59 +0000

Soeben ist der erste KIM-DINI-Technology-Watch-Report (ISSN 1865-3839) herausgekommen, in dem auf 35 Seiten aktuelle Entwicklungen aus dem Bereich Standards und Standardisierungender digitalen Informationsversorgung zusammengefasst werden. Ich habe einen Beitrag zu aktuellen Neuigkeiten aus dem Wikipedia-Projekt beigesteuert, den ich an dieser Stelle ebenfalls veröffentlichen möchte. Genannt werden Maßnahmen zur Qualitätssicherung und die Extraktion von Daten aus Wikipedia für das Semantic Web:

Während die freie, von Freiwilligen erstellte Enzyklopädie Wikipedia weiter beständig wächst (die Englische Ausgabe überschritt Anfang September die Marke von zwei Millionen Artikeln) gewinnt das Thema Qualität bei den Projektbeteiligten immer stärker an Bedeutung. Der Verein Wikimedia Deutschland veranstaltete dazu am 24./25. August in der Mainzer Akademie der Wissenschaften und der Literatur die zweite Wikipedia Academy, um Wikipedia stärker im akademischen Umfeld zu
etablieren.

Eine tiefgreifendere Entwicklung kündigt sich mit der inzwischen schon seit über einem Jahr erwarteten Wikipedia-Erweiterung der so genannte „stabilen Versionen“ an. Mit der Erweiterung Flagged Revisions lassen sich einzelne Artikelversionen auszeichnen und so beispielsweise als geprüft hervorheben. Dazu ist angedacht, Wissenschaftler oder andere Experten gezielt zum Review von Artikeln einzubinden. Unter http://quality.wikimedia.org kann die geplante Neuerung diskutiert und in Kürze testweise ausprobiert werden.

Da Wikipedia in zunehmendem Maße Faktenwissen enthält, wird die Enzyklopädie auch für Anwendungen des Semantic Web interessant. Anfang September wurde eine neuer Version der DBPedia-Datenbank veröffentlicht, die knapp 2 Millionen aus Wikipedia extrahierte Objekte und über 100 Millionen Aussagen und Verknüpfungen im RDF-Format enthält. Die Daten werden darüber hinaus werden mit einer steigenden Zahl von weiteren offenen Datenbanken verknüpft und können als eine gemeinsame Daktendatenbank abgefragt werden.

Quality studies at Wikimania2007

jakob — Sat, 04 Aug 2007 03:38:12 +0000

I just participated in a Wikimania 2007 session with two very smart talks about quality studies in Wikipedia. Both were examples of rare (but hopefully growing) number of scientific studies with knowledge of Wikipedia internals and relevance to the practical needs of Wikipedia. Last but not least they both include working implementations instead of ideas only.

First there is Using Natural Language Processing to determine the quality of Wikipedia articles by Brian Mingus, Trevor Pincock and Laura Rassbach. Brian, an undergraduate student at Colorado, presented a rating system that was trained by existing Quality assesments of Wikipedia articles and a large set of features that may possible be related to quality, mainly computed by methods of natural language processing. Machine classification could predict ratings very well. Brian believes (and convinced me) that the best approach to determining article quality is a combination of human ratings and machine classifications. Human ratings serve as training data and algorithms can reverse engineer the human ratings. You should not think that binary, top-down ratings like the upcoming stable versions done by expert are the definite solution – but they may be additional information to predict quality and to train automatic systems that reproduce and summarize quality ratings. I will publish links to Brian’s slides, paper and code as soon as I get them (we collect all slides). A preliminary paper is in the Wikimania Wiki.

In the second talk Luca De Alfaro presented A Content-Driven Reputation System for the Wikipedia. The basic idea of his clever algorithm to predict trustability of Wikipedia texts is that authors of long-lived contributions gain reputation and authors of reverted contributions lose reputation. The detection and rating of remaining, changed, and reverted contributions is similar to WikiBlame but much more elaborated. Based on the analysis of the whole English, Italian, and French Wikipedia paragraphs and words can be coloured: text that is likely to be trusted is white while text that better should be checked is more orange. See the demo and the paper presented at WWW2007 (slides will follow).

I hope that Wikipedians and local Wikimedia chapters will catch up these efforts to get the tools usable in practise, for instance at Wikimedia Toolserver. Wikipedians, developers, Wikimedia organizations, and scientists need to work together tightly to bring smart ideas in Wikipedia quality research to real applications.

Quality – Jakoblog

Neuigkeiten aus dem Wikipedia-Projekt

Quality studies at Wikimania2007