Mein letzter längerer Artikel für das Lexikon der Bibliotheks- und Informationswissenschaft (LBI), dessen letzter Band dieses Jahr erscheinen wird, beschreibt die Extensible Markup Language (XML). Wie bei den anderen Artikeln (vgl. Ontologie und Ontologiesprache sowie Metadaten) besteht die Kunst darin, sich auf das wesentliche zu Beschränken und sinnvoll mit den bereits festgelegten Artikeln zu verlinken.
Much of our work at GBV library network has to do with record formats and APIs. We harvest or get metadata records in a wide range of formats (with many different interpretations and misconstructions of these formats), convert records to a wide range of formats (with many special request how to interpret this formats), and provide records through various APIs. Some of these APIs allow you to select different record formats, for instance OAI-PMH (first published 2001), SRU (2003), and unAPI (2006). These APIs are based on HTTP for transport and XML for encoding of the records. There are also older APIs and encoding formats like Z39.50 and newer APIs like pure Linked Data and SPARQL for RDF. unAPI also supports non-XML formats, but in this article I will concentrate on XML-based formats.
The basic question (that I deal with since years) is „what exactely is a format and how do you refer to it?“. All three APIs provide a method for listing of all formats that are supported by a particular server. unAPI provides a „list of object formats“. Each format has a „name“, a „type“ (which must be an official Internet media type), and an optional documentation URL („docs“), which may refer to some human-readable documentation, or to an XML Schema (XSD) file. Here are three examples:
<format name="oai_dc" type="application/xml" docs="http://www.openarchives.org/OAI/2.0/oai_dc.xsd" /> <format name="pubmed" type="application/xml" docs="http://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html" /> <format name="mods" type="application/xml" docs="http://www.loc.gov/standards/mods/" /> <format name="marcxml" type="application/xml" docs="http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd" />
To avoid the uncertainty whether „docs“ references a formal schema or a plain document, there should have been a „schema“ attribute (first problem). To refer to a format in an unAPI request, you use the format’s „name“. In OAI-PMH you refer to a format by its „metadataPrefix“. You can get a list of supported formats with the ListMetadataFormats request. In addition to the „metadataPrefix“ each format has the location of an XML Schema („schema“) and an XML Namespace URI („metadataNamespace“). In theory the latter is dispensable, because each XSD document declares a namespace URI in its „targetNamespace“ attribute: Given a format with a schema that defines namespace „http://example.org/“ like this
I would expect records in this format to use this namespace, at least for the XML root element:
The OAI-PMH specification does not explicitly say that the „metadataNamespace“ must match the namespace in the schema file „schema“. What does it mean if they differ? (second problem).
In SRU a format is known as „schema“. A list of supported formats is contained in an explain request. Each schema has an optional „title“, a „name“ (used to refer to schemas in the „recordSchema“ HTTP parameter when doing a search query), an „identifier“, and an optional „location“. The „identifier“ contains an additional URI, and the „location“ contains a link to an XML Schema file or to some human-readable documentation (like the „docs“ attribute in unAPI). There is a list of known schemas at the SRU page, for instance:
|title and location||name||identifier|
|MODS Schema Version 3.0||mods||info:srw/schema/1/mods-v3.0|
|MODS Schema Version 3.3||mods||info:srw/schema/1/mods-v3.3|
Note that one name (for instance „mods“) can refer to several schemas, but one particular SRU server can only provide one particular format under this name. The additional identifier neither refers to a particular XML Schema (Third problem). The identifier may only give a hint which particular version or interpretation of a format is provided.
Does anyone really need this diverse methods to refer to formats? I found in practice you cannot rely on the claimed format anyway, unless you can automatically validate it. That’s what XML Schema can be used for. I don’t say that XML Schema is the best or only method to formally describe an XML-based format (personally I much bettter like RELAX NG), but if there is an XML Schema – shouldn’t this schema be enough to identify the format?. Is there really a need of four independent identifiers to refer to an XML-based format? In the worst case we have:
- Schema Name (e.g. mods)
- Schema Location (e.g. http://www.loc.gov/standards/mods/v3/mods-3-3.xsd)
- Schema Identifier (e.g. info:srw/schema/1/mods-v3.3)
- Schema Namespace (e.g. http://www.loc.gov/mods/v3)
This is bad design, because you cannot say which of the four is the right one and how they relate to each other. A clean solution would only have two identifiers for XML-based formats:
- The local name, which is only unique for a particular API and a particular server
- The global schema Location, which is a cool URI that resolves to an XML Schema file.
The Schema Namespace is included as „targetNamespace“ in the XML Schema, and the Schema Identifier is delusion anyway. Either you can identify a format by a formal schema (that can also be used to validate records) or you just cannot guarantee which format your records will be in. Sure you can give some hints by linking to documentations, examples, and guidelines. But adding more identifiers is a fakery of control. You are still allowed to provide more specific formats, variants, application profiles, and interpretations under different names. But these formats don’t get more clear or usable if you give them a „Schema Identifier“. Does anyone uses SRU’s Schema Identifiers anyway? I think for XML we can better live with XML Schemas that the XML namespaces can be extracted from. An application can identify a format by its schema location, by the XML namespace, and/or by other information contained in the schema. Additional pointers to human-readable documentation are great. But don’t confuse description with identification if you need to refer to a data format.