Goethe erklärt das Semantic Web

20. Mai 2012 um 15:49 2 Kommentare

Seit Google vor einigen Tagen den “Knowledge Graph” vorgestellt hat, rumort es in der Semantic Web Community. Klaut Google doch einfach Ideen und Techniken die seit Jahren unter der Bezeichnung “Linked Data” und “Semantic Web” entwickelt wurden, und verkauft das ganze unter anderem Namen neu! Ich finde sowohl die Aufregung als auch die gedankenlose Verwendung von Worten wie “Knowledge” und “Semantic” auf beiden Seiten albern.

Hirngespinste von denkenden Maschinen, die “Fakten” präsentieren, als seien es objektive Urteile ohne soziale Herkunft und Kontext, sind nun eben Mainstream geworden. Dabei sind und bleiben es auch mit künstlicher Intelligenz immer Menschen, die darüber bestimmen, was Computer verknüpfen und präsentieren. Wie Frank Rieger in der FAZ gerade schrieb:

Es sind „unsere Maschinen“, nicht „die Maschinen“. Sie haben [...] kein Bewusstsein, keinen Willen, keine Absichten. Sie werden konstruiert, gebaut und eingesetzt von Menschen, die damit Absichten und Ziele verfolgen – dem Zeitgeist folgend, meist die Maximierung von Profit und Machtpositionen.

In abgeschwächter Form tritt der Irrglaube von wissenden Computern in der Fokussierung auf “Information” auf, während in den meisten Fällen stattdessen Daten verarbeitet werden. Statt eines “Knowledge Graph” hätte ich deshalb lieber einen “Document Graph”, in dem sich Herkunft und Veränderungen von Aussagen zurückverfolgen lassen. Ted Nelson, der Erfinder des Hypertext hat dafür die Bezeichnung “Docuverse” geschaffen. Wie er in seiner Korrektur von Tim Berners-Lee schreibt: “not ‘all the world’s information’, but all the world’s documents.” Diese Transparenz liegt jedoch nicht im Interesse von Google; der Semantic-Web-Community ist sie die Behandlung von Aussagen über Aussagen schlicht zu aufwendig.

Laut lachen musste ich deshalb, als Google ein weiteres Blogposting zur Publikation von gewichteten Wortlisten mit einem Zitat aus Goethes Faust beginnen lässt:

Yet in each word some concept there must be…

Im “Docuverse” wäre dieses Zitat durch Transklusion so eingebettet, dass sich sich der Weg zum Original zurückverfolgen ließe. Hier der Kontext des Zitat von Wikisource:

Mephistopheles: [...] Im Ganzen – haltet euch an Worte! Dann geht ihr durch die sichre Pforte Zum Tempel der Gewißheit ein.

Schüler: Doch ein Begriff muß bey dem Worte seyn.

Mephistopheles: Schon gut! Nur muß man sich nicht allzu ängstlich quälen; Denn eben wo Begriffe fehlen, Da stellt ein Wort zur rechten Zeit sich ein. Mit Worten läßt sich trefflich streiten, Mit Worten ein System bereiten, An Worte läßt sich trefflich glauben, Von einem Wort läßt sich kein Jota rauben.

Die Antwort von Google (und nicht nur Google) auf den zitierten Einwand des Schülers gleicht nämlich bei näherer Betrachtung der Antwort des Teufels, wobei das “System” das uns hier “bereitet” wird ein algorithmisches ist, das nicht auf Begriffen sondern auf Wortlisten und anderen statistischen Verfahren beruht.

In der Zeitschrift für kritische Theorie führt Marcus Hawel zu eben diesem Zitat Goethes (bzw. Googles) aus, dass Begriffe unkritisch bleiben, solange sie nur positivistisch, ohne Berücksichtigung des “Seinsollen des Dings”, das bestehende “verdoppeln” (vgl. Adorno). Wenn Google, dem Semantic Web oder irgend einem anderen Computersystem jedoch normative Macht zugebilligt wird, hört der Spaß auf (und das nicht nur aufgrund der Paradoxien deontischer Logik). Mir scheint, es mangelt in der semantischen Knowledge-Welt an Sprachkritik, Semiotik und kritischer Theorie.

HTTP Content negotiation and format selection

9. März 2012 um 15:33 1 Kommentar

In University of Southampton’s WebTeam Blog Christopher Gutteridge just complained that HTTP content negotiation could do better. And he is right. In a nutshell, content negotiation allows to retrieve different forms, versions or representations of a digital document at the same URI.

Content negotiation is an important part of the Web architecture, but sucks for several reasons. The main problem is: there is no consensus about what defines a format/version/representation etc. and how to refer to selected forms of a digital document. At least for selection by different versions in time, there is a good specification with Memento and Accept-Datetime header. Content negotiation by language (Accept-Language) is similar unless you want to query by other aspects of language (slang, readability…), because language tags are clearly defined and precise. But the concept of “content types” (Accept header, also known as MIME types) is rather fuzzy.

Earlier I wrote about the difficulties to define “publication types” – it’s the same problem: there is no disjoint set of content types, document types, publication types, or formats! Each document and each representation can belong to multiple types and types may overlap. The easiest case, described by Christopher, is a subset relation between two types: for instance every XML document is also a Unicode string (be warned: there is no hierarchy of types, but maybe a Directed Acyclic Graph!). Not even this simple relationship between content types can be handled by HTTP content negotiation.

Anyway, the real reason for writing this post is yet another CPAN module I wrote: Plack::App::unAPI implements unAPI. unAPI is a kind of “poor man’s content negotiation”. Instead of sending additional HTTP headers, you directly select a format with format=... parameters. In contrast to HTTP Content negotiation, an unAPI server also returns a list of all formats it supports. This is a strong argument for unAPI in my opinion. I also added a method to combine unAPI with HTTP content negotiation.

BibSonomy usability disaster

8. März 2012 um 12:01 1 Kommentar

In addition to other citation management systems, I use BibSonomy. It’s one of the less known social cataloging tools. Not as commercial and shiny as Mendeley, but useful, especially if you happen to collect many papers in computer sciences. The BibSonomy team is open and friendly and the project is connected to the local University in Kassel (by the way, they are hiring students!). I still like BibSonomy – that’s why I wrote this rant.

Since last week BibSonomy has a new layout. Sure people always complain when you change their used interface, but this change is a usability desaster – at least if you work on a netbook with small screen, like me. To illustrate the problems here is a screenshot with my quick notes in red:

The screen is crowded with irritating and ugly user interface elements. The actual usage area is put into a little frame. Yes, a frame, in 2012! You cannot just scroll down to get rid of the header, because the frame is fixed. There are more flaws, for instance the duplication of account elements, scattered at not less than four places (logged in as…, logout, user:…, myBibsobomy) and the ugly custom icons (there are several great icon sets that could be used instead, for instance Silk, GCons, Glyphicons, Picol…). It does not get better of you try to edit content. I am sorry, but this is a usability disaster.

P.S: The BibSonomy team has responded and they fixed part of the problem with a nasty hack, based on actual screen size.

Embedded diagrams and pandoc

24. Januar 2012 um 13:02 Keine Kommentare

If you don’t know John MacFarlane’s Pandoc, the “Swiss army knive of document formats”, you should definitely give it a try! Pandoc’s abstract document model and its serialization in an extended variant of Markdown markup let you focus on the structure and content of a text instead of dealing with formats and user interfaces. In my opinion pandoc is the best tool for document creation invented since (La)TeX (moreover pandoc is a good argument to finally learn programming in Haskell) Images in pandoc markdown documents, however, are only referenced by their file. This requires some preprocessing if you want to create different files for different document formats, especially bitmap images and vector images. So I hacked a little preprocessing script that let’s you embed images in pandoc’s markup language. For instance you write

~~~~ {.dot .Grankdir:LR}
digraph {
A -> B -> C;
A -> C;
}
~~~~

and you get

or based on rdfdot you write

~~~~ {.rdfdot}
@prefix foaf: <http: //xmlns.com/foaf/0.1/> .
@base <http: //example.com/> .
<alice> foaf:name “Alice” ;
foaf:knows [ foaf:name "Bob" ] .
~~~~

and you get

A detailed description is included in the manual which has been transformed automatically to HTML and to PDF. Compare both documents to see that HTML includes PNG images and PDF contains vector images!

Feel free to reuse and modify the script, for instance by adding more diagram types! For instance how about ASCII tabs and ABC notation if you write about music?

Linked local library data simplified

10. Januar 2012 um 14:53 1 Kommentar

A few days ago Lukas Koster wrote an article about local library linked data. He argues that bibliographic data from libraries data as linked data is not “the most interesting type of data that libraries can provide”. Instead “library data that is really unique and interesting is administrative information about holdings and circulation”. So libraries “should focus on holdings and circulation data, and for the rest link to available bibliographic metadata as much as possible.” I fully agree with this statements but not with the exact method how do accomplish the publication of local library data.

Among other project, Koster points to LibraryCloud to aggregate and deliver library metadata, but it looks like they reinvent yet more wheels in form of their own APIS and formats for search and for bibliographic description. Maybe I am wrong about this project, as they just started to collect holding and circulation data.

At the recent Semantic Web in Bibliotheken conference, Magnus Pfeffer gave a presentation about “Publishing and consuming library loan information as linked open data” (see slides) and I talked about a Simplified Ontology for Bibliographic Resources (SOBR) which is mainly based on the DAIA data model. We are going to align both data models and I hope that the next libraries will first look at these existing solutions instead of inventing yet another data format or ontology. Koster’s proposal, however, looks like such another solution: he argues that “we need an extra explicit level to link physical Items owned by the library or online subscriptions of the library to the appropriate shared network level” and suggests to introduce a “holding” level. So there would be five levels of description:

  • Work
  • Expression
  • Manifestation
  • Holding
  • Item

Apart from the fact that at least one of Work, Expression, Manifestation is dispensable, I disagree with a Holding level above the Item. My current model consists of at most three levels of documents:

  • document as abstract work (frbr:Work, schema:CreativeWork…)
  • bibliographic document (frbr:Manifestation, sobr:Edition…)
  • item as concrete single copy (frbr:Item…)

The term “level” is misleading because these classes are not disjoint. I depicted their relationship in a simple Venn diagram:

For local library data, we are interested in single items, which are copies of general documents or editions. Where do Koster’s “holding” entities fit into this model? He writes “a specific Holding in this way would indicate that a specific library has one or more copies (Items) of a specific edition of a work (Manifestation), or offers access to an online digital article by way of a subscription.” The core concepts as I read them are:

  • “one or more copies (Items)” = frbr:Item
  • “specific edition of a work (Manifestation)” = sobr:Edition or frbr:Manifestation
  • “has one [...] or offer access to” = ???

Instead of creating another entity for holdings, you can express the ability “to have one or offer access to” by DAIA Services. The class daia:Service can be used for an unspecified service and more specific subclasses such as loan, presentation, and openaccess can be used if more is known. Here is a real example with all “levels”:


<http://dbpedia.org/resource/Tractatus_Logico-Philosophicus>
    a bibo:Book ;
    daia:edition <urn:nbn:de:eki:GBV128382600> ;
    daia:exemplar
        <http://uri.gbv.de/document/opac-de-23:epn:266449999> .

<urn:nbn:de:eki:GBV128382600> a bibo:Book ;
    daia:exemplar
        <http://uri.gbv.de/document/opac-de-23:epn:266449999> .

<http://uri.gbv.de/document/opac-de-23:epn:266449999>
    a bibo:Book, daia:Item ;
    daia:heldBy <http://uri.gbv.de/organization/isil/DE-23> ;
        daia:availableFor [
            a daia:Service ;
            daia:providedBy <http://uri.gbv.de/organization/isil/DE-23>
        ] .

I have only made up the RDF property daia:edition from the SOBR proposal because FRBR relations are too strict. If you know a better relation to directly relate an abstract work to a concrete edition, please let me know.



image created with rdfdot

Request for comments: final specification of DAIA

6. Januar 2012 um 12:13 2 Kommentare

When I started to create an API for availability lookup of document in libraries in 2008, I was suprised that such a basic service was so poorly defined. The best I could find was the just-published recommendation of the Digital Library Federation (DLF-ILS). Even there availability status was basically a plain text message (section 6.3.1 and appendix 4 and 5). Other parts of the DLF-ILS GetAvailability response were more helpful, so they are all part of the Document Availability Information API (DAIA). Here is a simple mapping from DLF-ILS to DAIA:

  • bibliographicIdentifer (string) → document (URI)
  • itemIdentifier (string) → item (URI)
  • dateAvailable (dateTime) → expected (xs:dateTime or xs:date or “unknown”) or delay (xs:duration or “unknown”)
  • location (string) → storage (URI and/or string, plus optional URL)
  • call number (string) → label (string)
  • holdQueueLength (int) → queue (xs:nonNegativeInteger)
  • status (string) and circulating (boolean) → available/unavailable (with service type and additional information)

So you could say that DAIA implements the abstract GetAvailability function from DLF-ILS. I like abstract, language independent specifications, but they must be precise and testable (see Meek’s forgotten paper The seven golden rules for producing language-independent standards). DAIA is more than an implementation: it provides both, an abstract standard and bindings to several data languages (XML, JSON, and RDF). The conceptual DAIA data model defines some basic concepts and relationships (document, items, organisations, locations, services, availabilities, limitations…) independent from whether they are expressed in XML elements, attributes, RDF properties, classes, or any other data structuring method. The only reference to specific formats is the requirement that all unique identifiers must be URIs. Right now there is an XML Schema if you want to express DAIA in XML and an OWL ontology for RDF.

In its fourth year of development (see my previous posts from 2009) DAIA seems to have enough momentum to finally get accepted in practice. We use it in GBV library union (public server at http://daia.gbv.de/), there are independent implementations such as in Doctor-Doc, there is client-support in VuFind and I heard rumors that DAIA capabilities will be build into EBSCO and Summon Discovery Services. Native support in Integrated Library Systems, however, is still lacking – I already have given up hope and prefer a clean DAIA wrapper over a broken DAIA-implementation anyway. If you are interested in creating your own DAIA server/wrapper or client, have a look at my reference implementation DAIA and Plack::App::DAIA at CPAN and Oliver Goldschmidt’s PHP implementation in our common github repository. A conceptual overview as tree (DAIA/JSON, DAIA/XML) and as graph (DAIA/RDF) can be found here.

Still there are some details to be defined and I’d like to solve these issues to come to a version DAIA 1.0. These are

  • How to deal with partial publications (you requested an article but only get the full book or you requested a series but only get a single volume).
  • How to deal with digital publications (especially its possible service types: is “download” a service distinct to “loan” or is “presentation” similar to online access restricted to the library’s intranet?).
  • Final agreement on service types (now there are presentation: item can be used in the institution, loan: item can be used outside of the institution for a limited time, interloan: item can be send to another institution, openaccess: item can be access unrestricted, just get a free copy). Some extensions have been proposed.
  • A set of common limitation types (for instance IP-based access restriction, permission-based access etc.).

I’d be happy to get some more feedback on these issues, especially concrete use cases. We are already discussing on the daia-devel mailing list but you can also comment in your own blog, at public-lld, code4lib, ils-di etc.).

Can SOBR help publishing library holdings?

2. Dezember 2011 um 01:08 5 Kommentare

I just participated in the German conference Semantic Web in Bibliotheken which took place in Hamburg this week. This year there were two slots for lightning talks, but unfortunately participants did not catch on, so we only had four of them. Lightning talks are a good chance to present something unfinished that you need input for, so I presented the Simplified Ontology for Bibliographic Resources (SOBR) as “FRBR light”. You can find the current specification of SOBR at github, which means the specification is still evolving and I’d like to get more feedback.

SOBR was caused by a discussion on the Library Linked Data mailing list about the (disputed) disjointedness of FRBR classes. SOBR has a history in the Document Availability Information API (DAIA), which SOBR might be merged into. The use case of both is publishing information about holdings from library catalogs as Linked Open Data. The information most requested is probably connected to holdings: library users only ask “where is it?” and “how can I get it?”. In this questions, the little word “it” refers to a specific publication. In the answers, however, “it” usually refers to some holding or copy of this publication. Sometimes the holding contains more than the publication (for instance if you ask for an article in a book) and sometimes you get multiple holdings (for instance if you ask for a a large work that is split in multiple volumes). Sometimes there are multiple holdings with different content to choose from, because there are different editions, forms, translations etc. of the requested publication.

A long time ago, some librarians thought about similar questions and answers and came up with the Functional Requirements for Bibliographic Records (FRBR). I tried hard to accept FRBR (I even draw this ugly diagrams that people find when they look up FRBR in Wikipedia). But FRBR does not help me to publish existing library catalogs as Linked Open Data. In our catalog databases we have records that refer to editions, connected with records that refer to holdings (I’ll ignore the little exceptions and nasty special cases such as multiple holdings described by one look-like-a-holding-record). In addition there are some records that refer to series, works, and other types of abstract documents without direct holdings, which are connected to records that refer to editions.

Maybe we can simplify this to two entities: general documents (bibo:Document) and items (with frbr:Item) as special kind of documents. The current design of SOBR also contains a class for editions, but I am not sure whether this class is also needed. At least we need three properties to relate documents to items (daia:exemplar), to relate documents to editions (daia:edition?) and to relate documents to its parts (dcterms:hasPart). To avoid the need of blank nodes, I’d also define properties that relate documents to partial items (daia:extract = dcterms:hasPart + daia:exemplar) and to relate documents to partial editions (daia:editionPart?)

Feedback on SOBR is welcome, especially if you provide examples with existing URIs (or at least local identifiers to already existing data) instead of theoretical FRBR-like-made-up examples. The best way to find a good ontology for publishing library holdings is to actually publish data that describes library holdings! The following image is based on an example that connects a work from LibraryThing and from DBPedia with a partial edition from Worldcat, a full edition from German National Library, and a holding from Hamburg University:

@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix daia: <http://purl.org/ontology/daia/> .
@prefix frbr: <http://purl.org/vocab/frbr/core#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<http://www.librarything.com/work/70394> a bibo:Document ;
  owl:sameAs <http://dbpedia.org/resource/Living_My_Life> ;
  daia:edition <http://d-nb.info/1001703464> , [
    a bibo:Collection # , daia:Document
    ; dct:hasPart <http://www.worldcat.org/oclc/656754414>
  ]
    ;
  # daia:exemplar <http://d-nb.info/1001703464> ; ?
  daia:editionPart <http://www.worldcat.org/oclc/656754414> .

<http://d-nb.info/1001703464> a frbr:Item , bibo:Document ;
   daia:exemplar <http://uri.gbv.de/document/opac-de-18:epn:1220640794> .



image created with rdfdot

URI namespace lookup with prefix.cc and RDF::NS

3. November 2011 um 17:13 Keine Kommentare

Probably the best feature of RDF is that it forces you to use Uniform Resource Identifiers (URI) instead of private, local identifiers which only make sense in a some context. URIs are long and cumbersome to type, so popular URIs are abbreviated with namespaces prefixes. For instance foaf:Person is expanded to http://xmlns.com/foaf/0.1/Person, once you have defined prefix foaf for namespace http://xmlns.com/foaf/0.1/. In theory URI prefixes in RDF are arbitrary (in contrast to XML where prefixes can matter, in contrast to popular belief). In practice people prefer to agree to one or two known prefixes for common URI namespaces (unless they want to obfuscate human readers of RDF data).

So URI prefixes and namespaces and are vital for handling RDF. However, you still have to define them in almost every file and application. In the end people have copy & paste the same prefix definitions again and again. Fortunately Richard Cyganiak created a registry of popular URI namespaces, called prefix.cc (it’s open source), so people at least know where to copy & paste from. I had enough of copying the same URI prefixes from prefix.cc over and over again, so I created a Perl module that includes snapshots of the prefix.cc database. It includes a simple command line client, that is installed automatically:

  $ sudo cpanm RDF::NS
  $ rdfns rdf,foaf.ttl
  @prefix foaf:  .
  @prefix rdf:  .

In your Perl code, you can use it like this:

  use RDF::NS
  my $NS = RDF::NS->new('20111102');
  $NS->foaf_Person; # returns "http://xmlns.com/foaf/0.1/Person"

If you miss an URI prefix, just add it at http://prefix.cc, and will be included in the next release.

TPDL 2011 Doctoral Consortium – part 3

25. September 2011 um 17:36 Keine Kommentare

See also part 1 and part 2 of conference-blogging and #TPDL2011 on twitter.

My talk about general patterns in data was recieved well and I got some helpful input. I will write about it later. Steffen Hennicke, another PhD student of my supervisor Stefan Gradman, then talked about his work on modeling Archival Finding Aids, which are possibly expressed in EAD. The structure of EAD is often not suitable to answer user needs. For this reason Hennicke analyses EAD data and reference questions, to develope better structures that users can follow to find what they look for in archives. This is done in CIDOC-CRM as a high-level ontology and the main result will be an expanded EAD model in RDF. To me the problem of “semantic gaps” is interesting, and I think about using some of Hennicke data as example to explain data patterns in my work.

The last talk by Rita Strebe was about aesthetical user experience of websites. One aim of her work is to measure the significance of aesthetical perception. In particular her hypothesis to be evaluated by experiments are:

H1: On a high level, the viscerally perceived visual aesthetics of websites effects
approach behaviour.
H2: On a low level, the viscerally perceived visual aesthetics of websites effects
avoidance behaviour.

Methods and preliminary results look valid, but the relation to digital libraries seems low and so was the expertise of Strebe’s motivation and methods among the participants. I suppose her work better fits to Human-Computer Interaction.

After the official part of the program Vladimir Viro briefly presented his music search engine peachnote.com, that is based on scanned muscial scores. If I was working in or with musical libraries, I would not hesitate to contact Viro! I also though about a search for free musical scores in Wikimedia framework. The Doctoral Consortium ended with a general discussion about dissertation, science, libraries, users, and everything, as it should be :-)

TPDL 2011 Doctoral Consortium – part 2

25. September 2011 um 12:42 Keine Kommentare

The TPDL 2011 Doctoral Consortium, which I already blogged about in part 1, continued with 15 minutes of delay: Christopher Gibson also talked about eBooks – I wonder why his talk was not combined with Luca Colombo’s work in eBook reading experiences. Gibson’s specific topic is eBook lending services in UK public libraries. To quote the research questions from his paper:

Q1. How have public libraries addressed ebook service provision in the UK?
Q2. What challenges and opportunities exist in incorporating ebook lending into other reader services?
Q3. Is it feasible to lend ebook reading devices from public libraries?
Q4. How can the effectiveness of ebook lending services be measured?
Q5. How do library users view the provision of ebook lending services?
Q6. How can effective ebook lending services be developed?

To me an interesting aspect of his methodology was the use of targeted FOI (freedom of information) requests to gather data about eBook lending services. I cannot image this in this Germany where “Informationsfreiheit” is still in its infancy. One result from another survery done by Gibson: most eBooks are not included in library catalogs. I think this failure is found in German libraries too. In summary the PhD project looked very profound with some real practical values for libraries. On the other hand, the theoretical contribution, for instance the question what “lending” can mean in a digital library work, was only added in the discussion afterwards.

The next presenting PhD student was Adam Sofronjievic. I am sorry that I could not fully concentrate on his talk about a New Paradigm of Library Collaboration although it seemed very interesting. My talk is next :-)

Powered by WordPress with Theme based on Pool theme and Silk Icons.
Entries and comments feeds. Valid XHTML and CSS. ^Top^