Identifier in RDF considered harmful

18. Juni 2013 um 11:31 7 Kommentare

Ich bin gerade dabei die RDF-Daten des Linked Data Service der ZDB zu analysieren, um sie direkt im RDF-Bibliotheksverzeichnis des GBV nutzen zu können. Dabei sind mir einige Unterschiede bei der Behandlung von Identifiern aufgefallen. Hier ein Beispiel aus den Daten der
Stabi Berlin (das RDF-Subjekt habe ich zum Kürzen durch $BIB ersetzt):

GBV-RDF

$BIB
  dc11:identifier "DE-1a" ;
  foaf:phone <tel:+49-30-2-66-333501> , <tel:+49-30-2-66-433888> .

ZDB-RDF

$BIB
  dc11:identifier "(ISIL)DE-1a" ;
  vcard:tel [
     a vcard:Pref, vcard:Voice ;
     rdf:value "+49 30 2 66-433888 (Auskunft)"
  ], [
     a vcard:Fax, vcard:Pref ;
     rdf:value "+49 30 2 66-333501" .
  ] .

Solche unterschiedlichen Kodierungen sind besonders dann problematisch wenn RDF-Daten aus mehreren Quellen zusammengeführt werden sollen. Plötzlich hat dann in diesem Beispiel die Stabi Berlin zweil Identifier und vier Telefonnummern. Telefonnummern lassen sich übrigens nach RDF 3966 auch als URI kodieren, für ISILs gilt dies leider nicht, weil die internationale ISIL-Agentur versäumt hat, sich darum zu kümmern. Grundsätzlich bestärkt mich dieses Beispiel in der Überzeugung, dass Identifier in RDF-Daten Müll sind, solange sie nicht in Form von URIs kodiert werden – und zwar in vielen Fällen besser nicht als HTTP-URIs in mehrfacher Ausführung, wie im Rahmen von Linked Data gängige Praxis!

Modeling is difficult

21. September 2011 um 00:33 2 Kommentare

Yesterday Pete Johnston wrote a detailed blog article about difficulties of “the right” modeling with SKOS, and FOAF in general, and about the proposed RDF property foaf:focus in particular. As Dan Brickley wrote in a recent mail “foaf:focus describes a link from a skos:Concept to ‘the thing itself’. Not every SKOS concept (in a thesauri of classification scheme) will have such a direct “thing”, but many do, especially concepts for people and places.”

Several statements in this discussion made me laugh and smile. Don’t get me wrong – I honor Pete, Dan, and the whole Semantic Web community, but there is a regular lack of philosophy and information science. There is no such thing as ‘the thing itself’ and all SKOS concepts are equal. Even the distinction between an RDF ‘resource’ and an SKOS ‘concept’ is artificial. The problem origins not from wrong modeling, which could be solved by the right RDF properties, but from different paradigms and cultures. There will always be different ways to describe the same ideas with RDF, because neither RDF nor any other technology will ever fully catch our ideas. These technologies are not about things but only about data. As William Kent wrote in Data Reality (1978): “The map is not the territory” (by the way, last year Chris Rusbridge has quoted Kent in the context of linked data). As Erik Wilde and Robert J. Glushko wrote in a great article (2008):

RDF has succeeded beyond the wildest expectations as a convenient format for encoding information in an open and easily computable fashion. But it is just a format, and the difficult work of analysis and modeling information has not and will never go away.

Ok, they referred not to “RDF” but to “XML”, so the quotation is wrong. But the statement is right for both data structuring methods. No matter if you put your data in XML, in RDF, or carve it in stone – there will never be a final model, because there’s more than one way to describe something.

Query parts in Linked Data URIs considered harmful

29. Mai 2011 um 01:40 Keine Kommentare

The generic syntax of URI (and of IRI with slightly different definition of the elements) is:

<scheme name> : <hierarchical part> [ ? <query> ] [ # <fragment> ]

RDF is based on “URI references” which have a different syntax, but all practical RDF data should follow the generic URI syntax. This is also implied by the Linked Data Principles which encourage you to use HTTP URIs. Furthermore it is a good advice not to include fragments in your URIs if you care about coexistence of the Web and the Semantic Web. Yes, there is RDF data with fragment parts and the so called Hash URIs are even given as one design pattern, but you can avoid a lot of trouble if you use URIs without them. By the way, fragment identifiers are also used a lot in JavaScript-based web applications that spoil the whole concept of REST as described in this recent in-depth article on Hash URIs.

I would even go further and say that well-designed URIs for Linked Data should also forgo the query part. A clean URI to be used as identifier in Linked Data should stick to this limited generic syntax:

http: <hierarchical part>

I do not argue to disallow URIs with query part, but I wonder how much they are really used and needed for resources other than Web resources. Especially URIs for non-information resources, should not have a query part. Most web applications will not distinguish between these two:

http://example.org?foo=1&bar=2

http://example.org?bar=2&foo=1

These are different URIs but equivalent URLs. Choosing URIs that are not equivalent to other URIs in common non-RDF applications is a good advice, isn’t it? If you choose a cleaner URI like http://example.org/foobar you could use additional query parts to refer to different representations (information resources) and services around the resource that is referenced by the base URI.

Proposed changes in VIAF RDF

13. April 2011 um 13:42 2 Kommentare

The Virtual International Authority File (VIAF) is one of the distinguished showcases of international library community projects. Since more then five years, name authority files from different countries are mapped in VIAF. With VIAF you can look up records about authors and other people, and see which identifiers are used for the same person in different national library catalogs. For some people there are also links to bibliographic articles in Wikipedia (I think only English Wikipedia, but you can get some mappings to other Wikipedias via MediaWiki API), and I hope that there will be links to LibraryThing author pages, too.

However, for two reasons VIAF is not used as much as it could be: first not enough easy-to-understand documentation, examples, and simple APIs; and second difficulties to adopt technologies by potential users. Unfortunately the second reason is the larger barrier: many libraries cannot even provide a simple way to directly link to publications from and/or about a specific person, once you got the right person identifier from VIAF. If you cannot even provide such a fundamental method to link to your database, how should you be able to integrate VIAF for better retrieval? VIAF can do little about this lack of technical skills in libraries, it can only help integrating VIAF services in library software to some degree. This brings me to the other reason: you can always further improve documentation, examples, the design of you APIs, etc. to simplify use of your services. As a developer I found VIAF well documented and not very difficult to use, but there are many small things that could be made better. This is natural and a good thing, if you communicate with your users and adopt suggested changes, as VIAF does.

For instance yesterday Jeffrey A. Young, one of the developers behind VIAF at OCLC published a blog article about proposed changes to the RDF encoding of VIAF. I hope that other people will join the discussion so we can make VIAF more usable. There is also a discussion about the changes at the library linked data mailing list. And earlier this month, at the Code4Lib mailing list, there was a a controversial thread about the problems to map authority records that are not about people (see my statement here).

I appreciate the simplification of VIAF RDF and only disagree in some details. The current proposal is illustrated in this picture (copied from Jeffrey’s original article):

This looks straightforward, doesn’t it? But it only suits for simple one-to-one mappings. Any attempt to put more complex mappings into this scheme (as well as the existing VIAF RDF scheme) will result in a disaster. There is nothing wrong with simple one-to-one mappings, with SKOS you can even express different kinds of mappings (broader, narrower, exact, close), but you should not expect too much preciseness and detail. I wonder why at one side of the diagram links are expressed via foaf:focus and at the other side via owl:sameAs. In my opinion, as VIAF is about mapping authority files, all mapping links should use SKOS mapping properties. There is nothing wrong in declaring an URI like http://viaf.org/viaf/39377930/ to stand for both a foaf:Person, a rdaEnt:Person, and a skos:Concept. And the Webpage that gives you information about the person can also get the same URI (see this article for a good defense of the HTTP-303 mess). Sure Semantic Web purists, which still dream of hard artificial intelligence, will disagree. But in the end RDF data is alway about something instead of the thing itself. For practical use it would help much more to think about how to map complex concepts at the level of concept schemes (authority records, classifications, thesauri etc.) instead of trying to find a “right” model reality. As soon as we use language (and data is a specific kind of language), all we have is concepts. In terms of RDF: using owl:Thing instead of skos:Concept in most cases is an illusion of control.

Linked Data/RDF/SPARQL Documentation Challenge

23. Januar 2011 um 13:49 4 Kommentare

In a blog posting earlier in January Richard Pop descibed the difficulty to get started with Linked Data for the average programmer: “there is a new format (RDF), new query language (SPARQL) and new code libraries to learn and install (if you can find them in the first place)”. For this reason Pop created a challenge to come up with documentation for the basic steps of developing a Linked Data/RDF/SPARQL application. You can find contributions by Bill Roberts, Christopher Gutteridge, Pezholio, Gunnar Aastrand Grimnes, Tom Morris, Jeni Tennison (and here), Niklas Lindström, Felix Ostrowski, and John Goodwin. Meanwhile there are £50 prizes for solutions in PHP, Ruby, and Phython.

I have engaged in RDF since years, but mostly from a theoretical point of view. This year I started creating Linked Data applications with RDF::Trine, a powerfull programming library by Gregory Williams to handle RDF data in Perl (see more at perlrdf.org). I contributed some minor patches to further simplify usage. Next month I will write a tutorial on how to create an RDF based application in Perl. Meanwhile I work on a Perl module (limited preview) to create classifications and thesauri in SKOS/RDF.

Class or Property? Objectification in RDF and data modeling

14. August 2009 um 00:23 4 Kommentare

A short twitter statement, in which Ross Singer asked about encoding MARC relator codes in RDF, reminded me of a basic data modeling question that I am thinking about for a while: When should you model something as class and when should you model it as property? Is there a need to distinguish at all? The question is not limited to RDF but fundamental in data/information modeling. In Entity-relationship modeling (Chen 1976) the question is whether to use an entity or a relation. Let me give an example by two subject-predicat-object statements in RDF Notation3:

:Work dc:creator :Agent
:Agent rdf:type :Creator

The first statement says that a specific agent (:Agent) has created (dc:creator) a specific work (:Work). The second statement says that :Agent is a creator (:Creator). In the first dc:creator is a property while in the second :Creator is a class. You could define that the one implies the other, but you still need two different concepts because classes and properties are disjoint (at least in OWL – I am not sure about plain RDF). In Notation3 the implications may be written as:

@forAll X1, X2. { X1 dc:creator X2 } => { X2 a _:Creator }.
@forAll Y1. { Y1 a _:Creator } => { @forSome Y2. Y2 dc:creator Y1 }.

If you define two URIs for class and property of the same concept (the concept of a creator and creating something) then the two things are tightly bound together: Everyone who ever created something is a creator, and to be a creator you must have created something. This logic rule sounds rather rude if you apply it to other concepts like to lie and to be a liar or to sing and to be a singer. Think about it!

Beside the lack of fuzzy logic on the Semantic Web I miss an easy way to do “reification” (there is another concept called “reification” in RDF but I have never seen it in the wild) or “objectification”: You cannot easily convert between classes and properties. In a closed ontology this is less a problem because you can just decide whether to use a class or a property. But the Semantic Web is about sharing and combining data! What if Ontology A has defined a “Singer” class and Ontology B defined a “sings” property which refer to the same real-world concept?

Other data modeling languages (more or less) support objectification. Terry Halpin, the creator and evangelist of Object-Role Modeling (ORM) wrote a detailed paper about objectification in ORM whithout missing to mention the underlying philosophical questions. My (doubtful)
philosophic intuition makes me think that properties are more problematic then classes because the latter can easily be modeled as sets. I think the need for objectification and to bring together classes and properties with similar meaning will increase, the more “semantic” data we work with. In many natural languages you can use a verb or adjective as noun by nominalization. The meaning may slightly change but it is still very useful for communication. Maybe we should more rely on natural language instead of dreaming of defining without ambiguity?

Dublin Core conference 2008 started

23. September 2008 um 12:20 2 Kommentare

Yesterday the Dublin Core Conference 2008 (DC 2008) started in Berlin. The first day I spent with several Dublin Core Tutorials and with running after my bag, which I had forgotten in the train. Luckily the train ended in Berlin so I only had to get to the other part of the town to recover it! The rest of the day I visited the DC-Tutorials by Pete Johnston and Marcia Zeng (slides are online as PDF). The tutorials were right but somehow lost a bit between theory and practise (see Paul’s comment) – I cannot tell details but there must be a way to better explain and summarize Dublin Core in short. The problem may be in a fuzzy definition of Dublin Core. To my taste there are far to many “cans”, “shoulds”, and “mays” instead of formal “musts”. I would also stress more the importance of publicating stable URIs for everything and using syntax schemas.

What really annoys me on DC is the low committement of the Dublin Core Community to RDF. RDF is not propagated as fbase but only as one possible way to encode Dublin Core. The same way you could have argued in the early 1990s that HTTP/HTML is just one framework to build on. That’s right, and of course RDF is not the final answer to metadata issues – but it’s the state-of-the-art to encode structured data on the web. I wonder when the Dublin Core Community lost tight connection with the W3C/RDF community (which on her part was spoiled by the XML community). In official talks you don’t hear this hidden stories of the antipathies and self-interests in standardization.

The first keynote that I heard at day 2 was given by Jennifer Trant about results of steve.museum – one of the best projects that analyzes tagging in real world environments. Data, software and publications are available to build upon. The second talk – “Encoding Application Profiles in a Computational Model of the Crosswalk” by Carol Jean Godby (PDF-slides) – was interesting as well. In our library service center we deal a lot with translations (aka mappings, crosswalks etc.) between metadata formats, so the crosswalk web service by OCLC and its description language may be of large use – if it is proberly documented and supported. After this talk Maria Elisabete Catarino reported with “Relating Folksonomies with Dublin Core” (PDF-slides) from a study on the purposes and usage of social tagging and whether/how tags could be encoded by DC terms.

At Friday we will hold a first Seminar on User Generated Matadata with OpenStreetmap, Wikipedia, BibSonomy and The Open Library – looking forward to it!

P.S: Pete Johnston’s slides on DC basic concepts are now also available at slideshare [via his blog]

Powered by WordPress with Theme based on Pool theme and Silk Icons.
Entries and comments feeds. Valid XHTML and CSS. ^Top^