XML Schema vs. Library APIs (OAI-PMH/SRU/unAPI…)

24. Februar 2011 um 18:33 2 Kommentare

Much of our work at GBV library network has to do with record formats and APIs. We harvest or get metadata records in a wide range of formats (with many different interpretations and misconstructions of these formats), convert records to a wide range of formats (with many special request how to interpret this formats), and provide records through various APIs. Some of these APIs allow you to select different record formats, for instance OAI-PMH (first published 2001), SRU (2003), and unAPI (2006). These APIs are based on HTTP for transport and XML for encoding of the records. There are also older APIs and encoding formats like Z39.50 and newer APIs like pure Linked Data and SPARQL for RDF. unAPI also supports non-XML formats, but in this article I will concentrate on XML-based formats.

The basic question (that I deal with since years) is „what exactely is a format and how do you refer to it?“. All three APIs provide a method for listing of all formats that are supported by a particular server. unAPI provides a „list of object formats“. Each format has a „name“, a „type“ (which must be an official Internet media type), and an optional documentation URL („docs“), which may refer to some human-readable documentation, or to an XML Schema (XSD) file. Here are three examples:

<format name="oai_dc" type="application/xml"
<format name="pubmed" type="application/xml" 
<format name="mods" type="application/xml"
<format name="marcxml" type="application/xml" 

To avoid the uncertainty whether „docs“ references a formal schema or a plain document, there should have been a „schema“ attribute (first problem). To refer to a format in an unAPI request, you use the format’s „name“. In OAI-PMH you refer to a format by its „metadataPrefix“. You can get a list of supported formats with the ListMetadataFormats request. In addition to the „metadataPrefix“ each format has the location of an XML Schema („schema“) and an XML Namespace URI („metadataNamespace“). In theory the latter is dispensable, because each XSD document declares a namespace URI in its „targetNamespace“ attribute: Given a format with a schema that defines namespace „http://example.org/“ like this

<xs:schema targetNamespace="http://example.org/">

I would expect records in this format to use this namespace, at least for the XML root element:

<record xmlns="http://example.org/">

The OAI-PMH specification does not explicitly say that the „metadataNamespace“ must match the namespace in the schema file „schema“. What does it mean if they differ? (second problem).

In SRU a format is known as „schema“. A list of supported formats is contained in an explain request. Each schema has an optional „title“, a „name“ (used to refer to schemas in the „recordSchema“ HTTP parameter when doing a search query), an „identifier“, and an optional „location“. The „identifier“ contains an additional URI, and the „location“ contains a link to an XML Schema file or to some human-readable documentation (like the „docs“ attribute in unAPI). There is a list of known schemas at the SRU page, for instance:

title and location name identifier
MODS Schema Version 3.0 mods info:srw/schema/1/mods-v3.0
MODS Schema Version 3.3 mods info:srw/schema/1/mods-v3.3
MARCXML marcxml info:srw/schema/1/marcxml-v1.1

Note that one name (for instance „mods“) can refer to several schemas, but one particular SRU server can only provide one particular format under this name. The additional identifier neither refers to a particular XML Schema (Third problem). The identifier may only give a hint which particular version or interpretation of a format is provided.

Does anyone really need this diverse methods to refer to formats? I found in practice you cannot rely on the claimed format anyway, unless you can automatically validate it. That’s what XML Schema can be used for. I don’t say that XML Schema is the best or only method to formally describe an XML-based format (personally I much bettter like RELAX NG), but if there is an XML Schema – shouldn’t this schema be enough to identify the format?. Is there really a need of four independent identifiers to refer to an XML-based format? In the worst case we have:

  • Schema Name (e.g. mods)
  • Schema Location (e.g. http://www.loc.gov/standards/mods/v3/mods-3-3.xsd)
  • Schema Identifier (e.g. info:srw/schema/1/mods-v3.3)
  • Schema Namespace (e.g. http://www.loc.gov/mods/v3)

This is bad design, because you cannot say which of the four is the right one and how they relate to each other. A clean solution would only have two identifiers for XML-based formats:

  • The local name, which is only unique for a particular API and a particular server
  • The global schema Location, which is a cool URI that resolves to an XML Schema file.

The Schema Namespace is included as „targetNamespace“ in the XML Schema, and the Schema Identifier is delusion anyway. Either you can identify a format by a formal schema (that can also be used to validate records) or you just cannot guarantee which format your records will be in. Sure you can give some hints by linking to documentations, examples, and guidelines. But adding more identifiers is a fakery of control. You are still allowed to provide more specific formats, variants, application profiles, and interpretations under different names. But these formats don’t get more clear or usable if you give them a „Schema Identifier“. Does anyone uses SRU’s Schema Identifiers anyway? I think for XML we can better live with XML Schemas that the XML namespaces can be extracted from. An application can identify a format by its schema location, by the XML namespace, and/or by other information contained in the schema. Additional pointers to human-readable documentation are great. But don’t confuse description with identification if you need to refer to a data format.

P.S. At Code4lib mailing list Rob Sanderson pointed to our discussion we had about the same topic in 2009, and one of my earlier postings on XML4Lib also deals with SRU and namespaces.


RSS feed for comments on this post. TrackBack URI

  1. Jakob,

    Several of the OCLC Web Services use the SRU schemas. WorldCat Search API and Terminology Services most notably. What this allows the service to do is return records in different formats. The response is always at its heart an SRU response, but what type of records are retrieved changes based on the recordSchema parameter in SRU. OCLC creates our own Schema Identifers for this or refer to existing ones.

    For me the whole concept of format is really thorny because it can encompass lots of things. It can refer to a metadata schema (MODS, MARC, SKOS) or a serialization (XML, JSON).

    Furthermore you can have nested „formats“. For example, you can technically have an Atom feed with MODS or MARCXML embedded in it.

    RDF is a great example of both of these. When you say you’re getting back RDF you can be getting multiple metadata formats. Plus it can be serialized different ways.

    I’m not sure what the best solution is to this I just know its extremely complex.

    Comment by Karen Coombs — 24. Februar 2011 #

  2. Yes, the concept of a „format“ is difficult. Most formats are based on a nested chain of specifications, e.g. your specific application of MODS 3.3 as subset of MODS 3.3 as special case of MODS as application of XML based on a string of Unicode characters. There is often confusion which level you actually identify with a format identifier. Basically each particular SRU server provides its particular non-standard formats. Only these local formats are identified by the „recordSchema“ parameter. The SRU explain response contains additional information about a lower level of the format. But only the XML Schema is of use to identify an underlying format. In contrast the SRU identifier (info:srw/…) does not hold or link to any relevant information about a format. It does not give you more than the local format name, so it is a simulation of an identifier. But maybe these identifiers were only meant to have local meaning anyway? Maybe the SRU schema identifier is meant to look up the information contained in this table. But then it’s an open question what to do if the SRU explain response contains different information than the authoritative table. This problem is similar to the „metadataNamespace“ element in OAI-PMH.

    Comment by jakob — 24. Februar 2011 #

Sorry, the comment form is closed at this time.