Who identifies the identifiers?

10. Mai 2009 um 16:39 6 Kommentare

A few days ago, after a short discussion on Twitter, Ross Singer posted a couple of open questions about identifiers for data formats on code4lib and other related mailing lists. He outlined the problem that several APIs like Jangle, unAPI, SRU, OpenURL, and OAI-PMH use different identifiers to specify the format of data that is transported (MARC-XML, Dublin Core, MODS, BibTeX etc.). It is remarable that all these APIs are more or less relevant only in the libraries sector while the issue of data formats and its identifiers is also relevant in other areas – looks like the ivory tower of library standards is still beeing build on.

The problem Ross issued is that there is little coordination and each standard governs its own registry of data format identifiers. An inofficial registry for unAPI [archive] disappeared (that’s why I started the discussion), there is a registry for SRU, a registry for OpenURL, and a list for Jangle. In OAI-PMH and unAPI each service hosts its own list of formats, OAI-PMH includes a method to map local identifier to global identifiers.

On code4lib several arguments and suggestions where raised which almost provoced me to a rant on library standards in general (everyone want’s to define but noone likes to implement and reuse. Why do librarians ignore W3C and IETF?). Identifiers for data formats should neither be defined by creators of transport protocols nor do we need yet another über-registry. In my point of view the problem is less technical but more social. Like Douglas Campbell writes in Identifying the identifiers, one of the rare papers on identifier theory: it’s not a technology issue but a commitment issue.

First there is a misconception about registries of data format identifiers. You should distinguish descriptive registries that only list identifiers and formats that are defined elsewhere and authoritative registries that define identifiers and formats. Yes: and formats. It makes no sense to define an identifier and say that is stands for data format X if you don’t provide a specification of format X (either via a schema or via a pointer to a schema). This already implies that the best actor to define a format identifier is the creator of the format itself.

Second local identifiers that depend on context are always problematic. There is a well-established global identifier system called Uniform Resource Identifier (URI) and there is no excuse not to use URIs as identifiers but incapability, dullness, laziness, or ignorance. The same reasons apply if you create a new identifier for a data format that already has one. One good thing about URI is that you can always find out who was responsible for creating a given identifier: You start with the URI Scheme and drill down the namespaces and standards. I must admin that this process can be laborious but at least it makes registries of identifiers descriptive for all identifiers but the ones in their own namespace.

Third you must be clear on the definition of a format. For instance the local identifier „MARC“ does not refer to a format but to many variants (USMARC, UNIMARC, MARC21…) and encodings (MARCXML/MARC21). This is not unusual if you consider that many formats are specializations of other formats. For instance ATOM (defined by RFC4287 and RFC5023, identified either its Mime Type „application/atom+xml“ which can could expressed as URI http://www.iana.org/assignments/media-types/application/atom%2Bxml or by its XML Namespace „http://www.w3.org/2005/Atom“)* is extended from XML (specified in http://www.w3.org/TR/xml [XML 1.0] and http://www.w3.org/TR/xml11 [XML 1.1], identified by this URLs or by the Mime Type „application/xml“ which is URI http://www.iana.org/assignments/media-types/application/xml)*.

The problem of identifying the right identifiers for data formats can be reduced to two fundamental rules of thumb:

1. reuse: don’t create new identifiers for things that already have one.

2. document: if you have to create an identifier describe its referent as open, clear, and detailled as possible to make it reusable.

If there happen to exist multiple identifiers for one thing, choose the one that is documented and adopted best. There will always be multiple identifiers for the same thing – don’t make it worse.

*Footnote: The identification of Internet Media Types with URIs that start with http://www.iana.org/assignments/media-types/ is neither widely used nor documented well but it’s the most official URI form that I could find. If for a particular format there is a better identifier – like an XML or RDF namespace – then you should use that, but if there is nothing but a Mime Type then there is no reason to create a new URI on your own.


RSS feed for comments on this post. TrackBack URI

  1. „If for a particular format there is a better identifier – like an XML or RDF namespace – then you should use that“

    This is an unfortunately common misconception about the usage and meaning of XML namespaces. XML namespaces do NOT identify formats. As a demonstration of this, consider one namespace that defines two sets of elements which are never used together. Yes this is very poor design, but it is quite possible. Also consider MODS, which defines two top level elements, MODS and MODSCollection. If you used the namespace to ‚identify‘ the MODS format, you would not know whether you identified a collection or a single MODS instance. Finally, consider a namespace with many thousands of elements defined in it. Even if the top level tag were the same, the utility of such a ‚format‘ is minimal.

    Therefore, there must be a second identifier which is neither schema location (as this is non unique) nor XML namespace. As IETF and W3C have NO interest in this sort of thing, standards are forced to build their own registries.

    Comment by Rob Sanderson — 11. Mai 2009 #

  2. An XML namespace identifies the set of all elements that are defined in this namespace. If the creator of a format did not define another identifier for the format but provided one or more XML schemas with a common XML namespace, then you should better reuse this namespace as format identifier instead of inventing something on your own. Of course the namespace does not identify your favorite undocumented subset of elements but all elements in the namespace – but it’s still a format. If this format does not suit your needs, you don’t need another identifier but another format (which of course can be a subset of an existing format). If you need a MODS variant that excludes modsCollection then you should better talk to the LOC if they can clarify the use of URI fragment identifiers for subsets of an XML Schema, so http://www.loc.gov/standards/mods/v3#mods could identify the MODS variant without modsCollection (this is common practise at least in RDF Schemas and OWL).

    Comment by Jakob — 11. Mai 2009 #

  3. Actually an even better example of „namespace does not indicate format“ would be Dublin Core, which is about as descriptive as your analogy about „MARC“.

    For many, many developers, the „surprise“ at getting back a response to a request for Dublin Core in, say, application/x-turtle would be noticeable.

    I like your suggestion of the IANA registry for formats without namespaces, but how would it deal with a situation like application/x-foobar when it gets approved as application/foobaz?

    Comment by Ross — 11. Mai 2009 #

  4. Ok, I got the point: The Dublin Core namespace for the core element set is http://purl.org/dc/elements/1.1/. This does not identify a concrete encoding schema (RDF/XML, RDF/turtle, DCSV etc.) but an abstract data model. Luckily there must be 1-to-1 mappings between the encoding schemas, so they are interchangeable. In practise this is solved with HTTP accept headers, so it’s less a problem – but in general you are right about this. However I fear that the ambiguity of encoding schemas in contrast to abstract data models cannot finally be solved because data exchange always relies on some implicit context. On a higher level encoding schemas are also abstract models. Interesting issue, I will more think about it. The question about application/x-foobar and IANA I don’t really understand.

    Comment by jakob — 11. Mai 2009 #

  5. Right, ok, so take my above example with application/x-turtle. So the Wikipedia article on Turtle makes the claim:

    „The mime type of Turtle is application/x-turtle (if registered, application/turtle will be sought).“

    This of course means they may not have their wish granted (in this case it’s unlikely, but I suppose anything can happen during the RFC process) — what if IANA decides, instead, that the mime type should be application/rdf+turtle (honestly, I don’t know why it wouldn’t be that anyway). What happens to all of the resources that have describes themselves as http://www.iana.org/assignments/media-types/application/x-turtle?

    Also, going back to the DC example, you may not always have HTTP headers to glean the content-type from (again, taking the „other format as data transport“ approach: Atom, METS, SRU, OpenURL, etc.).

    Although maybe including the serialization as a mandatory attribute is overloading the role of the identifier.

    Comment by Ross — 11. Mai 2009 #

  6. […] one in a machine-readable format. But I wouldn’t like to invent my own format when there are dozens to choose from. Could anybody point me to a (preferably semantic web-compatible) format suitable […]

    Pingback by Reinouts’ Nerdy Notes » Bibliographic metadata formats — 4. Februar 2010 #

Sorry, the comment form is closed at this time.