en « Jakoblog — Das Weblog von Jakob Voß

Blog
About

XML Schema vs. Library APIs (OAI-PMH/SRU/unAPI…)

24. Februar 2011 um 18:33 2 Kommentare

Much of our work at GBV library network has to do with record formats and APIs. We harvest or get metadata records in a wide range of formats (with many different interpretations and misconstructions of these formats), convert records to a wide range of formats (with many special request how to interpret this formats), and provide records through various APIs. Some of these APIs allow you to select different record formats, for instance OAI-PMH (first published 2001), SRU (2003), and unAPI (2006). These APIs are based on HTTP for transport and XML for encoding of the records. There are also older APIs and encoding formats like Z39.50 and newer APIs like pure Linked Data and SPARQL for RDF. unAPI also supports non-XML formats, but in this article I will concentrate on XML-based formats.

The basic question (that I deal with since years) is „what exactely is a format and how do you refer to it?“. All three APIs provide a method for listing of all formats that are supported by a particular server. unAPI provides a „list of object formats“. Each format has a „name“, a „type“ (which must be an official Internet media type), and an optional documentation URL („docs“), which may refer to some human-readable documentation, or to an XML Schema (XSD) file. Here are three examples:

<format name="oai_dc" type="application/xml"
  docs="http://www.openarchives.org/OAI/2.0/oai_dc.xsd" 
/>
<format name="pubmed" type="application/xml" 
  docs="http://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html"
/>
<format name="mods" type="application/xml"
  docs="http://www.loc.gov/standards/mods/" 
/>
<format name="marcxml" type="application/xml" 
  docs="http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"
/>

To avoid the uncertainty whether „docs“ references a formal schema or a plain document, there should have been a „schema“ attribute (first problem). To refer to a format in an unAPI request, you use the format’s „name“. In OAI-PMH you refer to a format by its „metadataPrefix“. You can get a list of supported formats with the ListMetadataFormats request. In addition to the „metadataPrefix“ each format has the location of an XML Schema („schema“) and an XML Namespace URI („metadataNamespace“). In theory the latter is dispensable, because each XSD document declares a namespace URI in its „targetNamespace“ attribute: Given a format with a schema that defines namespace „http://example.org/“ like this

<xs:schema targetNamespace="http://example.org/">

I would expect records in this format to use this namespace, at least for the XML root element:

<record xmlns="http://example.org/">

The OAI-PMH specification does not explicitly say that the „metadataNamespace“ must match the namespace in the schema file „schema“. What does it mean if they differ? (second problem).

In SRU a format is known as „schema“. A list of supported formats is contained in an explain request. Each schema has an optional „title“, a „name“ (used to refer to schemas in the „recordSchema“ HTTP parameter when doing a search query), an „identifier“, and an optional „location“. The „identifier“ contains an additional URI, and the „location“ contains a link to an XML Schema file or to some human-readable documentation (like the „docs“ attribute in unAPI). There is a list of known schemas at the SRU page, for instance:

title and location	name	identifier
MODS Schema Version 3.0	mods	info:srw/schema/1/mods-v3.0
MODS Schema Version 3.3	mods	info:srw/schema/1/mods-v3.3
MARCXML	marcxml	info:srw/schema/1/marcxml-v1.1

Note that one name (for instance „mods“) can refer to several schemas, but one particular SRU server can only provide one particular format under this name. The additional identifier neither refers to a particular XML Schema (Third problem). The identifier may only give a hint which particular version or interpretation of a format is provided.

Does anyone really need this diverse methods to refer to formats? I found in practice you cannot rely on the claimed format anyway, unless you can automatically validate it. That’s what XML Schema can be used for. I don’t say that XML Schema is the best or only method to formally describe an XML-based format (personally I much bettter like RELAX NG), but if there is an XML Schema – shouldn’t this schema be enough to identify the format?. Is there really a need of four independent identifiers to refer to an XML-based format? In the worst case we have:

Schema Name (e.g. mods)
Schema Location (e.g. http://www.loc.gov/standards/mods/v3/mods-3-3.xsd)
Schema Identifier (e.g. info:srw/schema/1/mods-v3.3)
Schema Namespace (e.g. http://www.loc.gov/mods/v3)

This is bad design, because you cannot say which of the four is the right one and how they relate to each other. A clean solution would only have two identifiers for XML-based formats:

The local name, which is only unique for a particular API and a particular server
The global schema Location, which is a cool URI that resolves to an XML Schema file.

The Schema Namespace is included as „targetNamespace“ in the XML Schema, and the Schema Identifier is delusion anyway. Either you can identify a format by a formal schema (that can also be used to validate records) or you just cannot guarantee which format your records will be in. Sure you can give some hints by linking to documentations, examples, and guidelines. But adding more identifiers is a fakery of control. You are still allowed to provide more specific formats, variants, application profiles, and interpretations under different names. But these formats don’t get more clear or usable if you give them a „Schema Identifier“. Does anyone uses SRU’s Schema Identifiers anyway? I think for XML we can better live with XML Schemas that the XML namespaces can be extracted from. An application can identify a format by its schema location, by the XML namespace, and/or by other information contained in the schema. Additional pointers to human-readable documentation are great. But don’t confuse description with identification if you need to refer to a data format.

P.S. At Code4lib mailing list Rob Sanderson pointed to our discussion we had about the same topic in 2009, and one of my earlier postings on XML4Lib also deals with SRU and namespaces.

Tags: API, Metadata, XML 2 Kommentare

Named Entity Recognition with DBPedia

15. Februar 2011 um 14:55 5 Kommentare

Yesterday the DBPedia team released DBPedia Spotlight, a named entity recognition service based on structured data extracted from Wikipedia. You can access the service via Web APIs or download the software as Open Source. I could not resist to feed Spotlight its own description:

DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. Text annotation has the potential of enhancing a wide range of applications including search, faceted browsing and navigation. By connecting text documents with DBpedia, our system enables a range of interesting use cases. For instance, the ontology can be used as background knowledge to display complementary information on web pages or to enhance information retrieval tasks. Moreover, faceted browsing over documents and customization of web feeds based on semantics become feasible. Finally, by following links from DBpedia into other data sources, the Linked Open Data cloud is pulled closer to the Web of Documents.

Pretty cool, isn’t it? Natural Language Processing (NLP) for information extraction seems to be the next hype after Web 2.0 and Semantic Web. I don’t neglect the innovative capabilities of DBPedia Spotlight and similar tools, but you should never forget that these are just tools, which won’t automatically solve information problems, or replace all other tools. Given the example above, there is little chance that an automatic system will extract you an exact topic of the text (for instance „named entity recognition based on data extracted from Wikipedia“) because this requires much background knowledge combining domain-specific expertise with common sense. By the way: as long as both Wikipedia and NLP-software is mainly written by white males, the result of will always mirror a limited world-view.

You can compare the results of Spotlight with similar open services:

I found little overlap between the different services. Spotlight seems to provide more results (depending on the Text) on an error rate between 10% and 30%. You could use such tools for automatic subject indexing based on abstracts and use the result at least for ranking. Unfortunately in library metadata we often have no full text or abstract to annotate. Furthermore many library entities have no DBPedia entry but catalogers create new authority records if needed. What do you think, named entity recognition and other NLP techniques can be used for in metadata land? Can we give up controlled subject indexing in libraries in favour of automatic NLP-based indexing on the one side and social tagging on the other? Or is room for all of these approaches, and how can you successfully combine them?

Tags: NLP, Semantic Web, Wikipedia 5 Kommentare

Linked Data/RDF/SPARQL Documentation Challenge

23. Januar 2011 um 13:49 6 Kommentare

In a blog posting earlier in January Richard Pop descibed the difficulty to get started with Linked Data for the average programmer: „there is a new format (RDF), new query language (SPARQL) and new code libraries to learn and install (if you can find them in the first place)“. For this reason Pop created a challenge to come up with documentation for the basic steps of developing a Linked Data/RDF/SPARQL application. You can find contributions by Bill Roberts, Christopher Gutteridge, Pezholio, Gunnar Aastrand Grimnes, Tom Morris, Jeni Tennison (and here), Niklas Lindström, Felix Ostrowski, and John Goodwin. Meanwhile there are £50 prizes for solutions in PHP, Ruby, and Phython.

I have engaged in RDF since years, but mostly from a theoretical point of view. This year I started creating Linked Data applications with RDF::Trine, a powerfull programming library by Gregory Williams to handle RDF data in Perl (see more at perlrdf.org). I contributed some minor patches to further simplify usage. Next month I will write a tutorial on how to create an RDF based application in Perl. Meanwhile I work on a Perl module (limited preview) to create classifications and thesauri in SKOS/RDF.

Tags: linkeddata, rdf 6 Kommentare

Data is not meaning – but a web badge

6. Januar 2011 um 00:57 3 Kommentare

I’m am sure that Douglas Adams and John Lloyd had a word for it: you know exactly what you mean, but not how to call it. Recently I tried to find information about : A particular kind of „web banner“, „button“, or „badge“ with specific size, border, and two parts. I finally found out, that it is a 80×15 web badge as introduced by Antipixel in 2002. A helpful description of the format is given by ZwahlenDesign, who also points to two online badge creation tools: Brilliant Button Maker and Button Maker. Note that the tools use „Button“ instead of „Badge“ to refer to the same thing.

I created a web badge to promote a simple philosophical web standard: data is not meaning* Here is the data as 177 bytes hexdump:

89 50 4E 47 0D 0A 1A 0A 00 00 00 0D 49 48 44 52 00 00 00 50 00 00 00 0F 01 03 00 00 00 49 07 DA CC 00 00 00 01 73 52 47 42 00 AE CE 1C E9 00 00 00 06 50 4C 54 45 FF FF FF 00 00 00 55 C2 D3 7E 00 00 00 59 49 44 41 54 08 D7 63 F8 0F 07 0C 0D 0C 50 C0 C8 B0 FF FF FF 0F D8 99 0D 10 E6 8E CF 7D 05 2D 7E 86 42 2E 85 0C BB 73 EF 6E 7E 76 C2 73 52 4A 23 C3 EE C4 3B 06 AD 7E 95 41 21 1B C1 A2 4F 3C 3C 8D 7C 26 32 EC 78 7B 77 43 8B 9F A7 90 4B 22 B2 09 D8 AD 40 72 03 C2 65 00 CA 67 45 A7 86 69 B7 81 00 00 00 00 49 45 4E 44 AE 42 60 82

If data was meaning, that should be all to say. But data is just a stream of bits, bytes, numbers, characters, strings, nodes, triples, or sometimes even words. You have to make use of it in a meaningful way. For instance you could give the data above to a specific piece of software like your web browser. Here comes the data again:

This PNG image was the smallest I was able to create with optimized colors, LZW compression etc. There is another piece of data, that is only eight bits more (178 bytes) and looks the same as GIF image:

I could also express the monochrome badge with one bit per pixel. That makes 80×15 = 1200 bits = 150 bytes uncompressed. The meaning could be the same, but not when only looking at your browser (because this piece of software cannot handle my „ad-hoc monochrome 80×15 web badge format“).

I also created a version with color. Feel free to use and modify as you like. In this case the PNG with 198 bytes is slightly larger than the GIF with 196 bytes.

PNG:

GIF:

* I was suprised that there were only seven Google hits for this phrase, none of them with the same meaning (sic!) that I try to express by this article. The badge was inspired by this important warning sign.

P.S: Enough data philosophy, time for music. There is so much more than one and zero and one and zero and one!

Tags: Data Modeling, Formats, Icons, Philosophie 3 Kommentare

Is data a language? In search of the new discipline Data Linguistics

13. September 2010 um 01:24 33 Kommentare

Yesterday Jindřich asked me for a reason to treat data as a language. I really appreciate this little conversations in data philosophy, but after a while 140 characters get too limited. Half a year ago a similar discussion with Adrian became a small series of blog articles (in German). I doubt that you can find a simple and final answer to fundamental questions about data and meaning, because these questions touch the human condition. This is also the reason why we should never stop asking unless we give up being human beings.

Jindřich’s question first made me wonder because for me data obviously is a language. All data is represented as sequence of bits, which can easily be defined as formal language. But this argument is stupid and wrong. Although language can be described by formal languages (as introduced by Noam Chomsky), this description only covers synax and grammar. Above all, a description of language must not be confound with language itself: the map is not the territory.

But data is used to communicate just like natural (written) language. The vast amount and heterogenity of data sometimes makes us forget that all data is created by humans for humans. Let me start with a simple argument against the view of data as language. A lot of data is created by measuring nature. As nature is not language, measured data is not language. This argument is also wrong. We (humans) design measuring devices and define their language (sic!) in terms of units like length, duration, blood pressure, and so forth. These units do not exist independent from language, but only communicated via it. And most units describe much more complex and fuzzy concepts like „name“, and „money“, which only exist as social construct. A piece of data is a statement that can be false, true, nonsense, or all of it, depending on context. Just like language.

Nowadays we create a lot of data for machines. Is this an argument against nature as language? I don’t think so. We may say that a piece of data made a machine to perform some task, but the machine was designed to act in a specific way. Machines do not „understand“ data, they just react. If I use an axe to cut a tree, I do not send the tree a message of data that it interprets to cut itself. Of course computers are much more complex then trees (and much simpler from another point of view). The chain of reaction is much more subtle. And most times there are more participants. If I create some data for a specific program, I do not communicate with the program itself, but with everyone involved in creating the program and its environment. This may sound strange but compare the situation with legal systems: a law is a piece of language, used to communicate to other people: „don’t step on the grass“. Unforunately society makes us think that laws are static and independent from us. In the same way people think that data is shaped by computers instead of people. Next time you get angry about a program, think about the vendor and programmer. Next time you get angry about a law, think about the lawmakers.

What follows from treating data as language? I think we need a new approach to data, a dedicated study of data. I would call this discipline data linguistics. Linguistics has many sub-fields concerned with particular aspects of natural language. The traditional division in syntax, semantics, and pragmatics only describes one direction to look at language. Anthropological linguistics and sociolinguistics study the relation between language and society, and historical linguistics studies the history and evolution of languages, only to mention a few disciplines. Suprisingly, the study of data is much more limited — up to now there is no data linguistics that studies data as language. The study of data is mainly focused on its form, for instance on the study of formal languages in computer science, the study of digital media in cultural studies and media studies, or the study of forms and questionnaire in graphic design and public administration (forms could be a good starting point for data linguistics).

There are some other fields that combine data and linguistics but from different viewpoints: computational linguistics studies natural language by computational means, similar to digital humanities in general. In one branch of data analysis, linguistic summaries of data are created based on fuzzy set theory. They provide natural language statements, that capture the main characteristics of data sets. Natural language processing analyses textual data by algorithmic methods. But data linguistics that analyses data in general is still waiting to be discovered. We can only conjecture possible reasons for this lack of research:

Data is not seen as language.
Digital data is a relatively new phenomenon. The creation of data on a large scale mainly began in the 20th century, so there is not enough time to historically explore the topic.
In contrast to natural language, data is too heterogenious to justify a combined look at data in general.
data seems to be well-defined, so no research is needed.

What do you think?

Tags: Data Modeling, Dissertation, Philosophie 33 Kommentare

What is Semantic Information Retrieval?

19. August 2010 um 00:45 Keine Kommentare

The most fun part of my dissertation is when I can ~~procastinate~~ dig deeply to the foundation of computer and information science. Lately I tried to find out when the terms „file“ and the „directory“ were coined in its current sense. The first commercial disk drive was the IBM 350, introduced in 1956. It had the size of a wardrobe, stored 4.4 megabytes 6-bit-characters and could be leased for 3,200$/month. Instances of it were also called „files“. But user files first appeared in the early 1960s with the Compatible Time-Sharing System (CTSS), the earliest ancestor of Unix. You should watch this great video from 1964 in which Robert Fano talks about making computers accessible to people. A wonderful demonstration of one of the very first command lines of a multi-user system! The explicit aims and concepts of computer systems are very similar to today. The more I read about history of computing, the more it seems to be that all important concepts were developed in the 1960s and 1970s. The rest is just reinventing and application on a broader scale.

Robert Fano was director of project MAC, a laboratory that brought together pioneers in operating systems, artificial intelligence, and other areas of the emerging discipline computer science. I browsed the historical publications of the laboratory at MIT where you can find a report of CTSS. Also published at MAC in 1964, I stumbled upon Bertram Raphael’s PhD thesis. It is titled SIR: A COMPUTER PROGRAM FOR SEMANTIC INFORMATION RETRIEVAL and its abstracts sounds like todays Semantic Web propaganda:

This system demonstrates what can reasonably be called an ability to „understand“ semantic information. SIR’s semantic and deductive ability is based on the construction of an internal model, which uses word associations and property lists, for the relational information normally conveyed in conversational statements. […] The system has some capacity to recognize exceptions to general rules, resolve certain semantic ambiguities, and modify its model structure in order to save computer memory space.

The SIR expert system even seems to go beyong current RDF techniques in supporting exceptions. By the way Bertram Raphael was at MAC at the same time as Joseph Weizenbaum. Weizenbaum fooled expectations in articial intelligence with his program ELIZA that he created between 1964 and 1966. He later became an important critic of artificial intelligence and the application of computer technology in general. By the way we need more like him instead of well-meaning, megalomaniac technology evangelists. See the documentary Rebel at work about Weizenbaum or even better the promising film Plug & Pray!

So what is Semantic Information Retrieval? In short: bullshit. The term is also used independently for search indices on graph structured data (2009), digital libraries (1998) and more. But why bothering with words, meaning, and history if computers will surely „understand“ soon?

Tags: Dissertation, hype, Meaning, Semantic Web Keine Kommentare

An impression of the OPDS/OpenPub catalog data model

27. Mai 2010 um 00:05 7 Kommentare

A few days ago Ed Summers pointed me to the specification of the Open Publication Distribution System (OPDS) which was just released as version 0.9. OpenPub (an alias for OPDS) is part of the Internet Archive’s BookServer project to build an architecture for vending and lending digital books over the Internet. I wonder why I have not heard more of BookServer and OpenPub at recent library conferences, discussion lists, and journals but maybe current libraries prefer to stay in the physical world to become museums and archives. Anyway, I had a look at OpenPub, so here are my public notes of the first impressions – and my answer to the call for comments. Please comment if you have corrections or additions (or create an issue in the tracker)!

OPDS is a syndication format for electronic publications based on Atom (RFC 4287). Therefore it is fully based on HTTP and the Web (this place that current libraries are still about to discover). Conceptually OPDS is somehow related to OAI(-ORE) and DAIA but it is purely based on XML which makes it difficult to compare with RDF-based approaches. I tried to reengineer the conceptual data model to better seperate model and serialization like I did with DAIA. The goal of OPDS catalogs is „to make Publications both discoverable and straightforward to acquire on a range of devices and platforms“.

OPDS uses a mix of DCMI Metadata Terms (DC) elements and ATOM element enriched with some new OPDS elements. Furthermore it interprets some DC and ATOM elements in a special way (this is common in many data formats although frequently forgotten).

Core concepts

The core concepts of OPDS are Catalogs which are provided as ATOM Feeds (like Jangle which should fit nicely for library resources), Catalog Entries that each refer to one publication and Aquisition Links. There are two disjunct types of Catalogs: Navigation Feeds provide a browseable hierarchy and Acquisition Feeds contain a list of Publication Entries. I will skip the details on Navigation Feeds and search facities (possible via OpenSearch) but focus on Elements and Aquisition.

Catalog Elements

The specification distinguishes between Partial and Complete Catalog Entries but this is not relevant on the conceptual level. There we have two concepts that are not clearly seperated in the XML serialization: the Catalog Record and the Publication which a Catalog Record describes are mixed in one Catalog Element. The properties of a Catalog Record are:

atom:id: identifier of the catalog entry (MANDATORY)
atom:updated: modification timestamp of the catalog entry (MANDATORY)
atom:published: timestamp of when the catalog entry was first accessible

The properties of a Publication are:

dc:identifier: identifier of the publication
atom:title: title of the publication (MANDATORY)
atom:author: creator of the publication (possibly with sub-properties)
atom:contributors: additional contributors to the publication (dito)
atom:category: publication’s category, keywords, classification codes etc. (with sub-properties scheme, term, and label)
dc:issued: first publication date of the publication
atom:rights: rights held in and over the publications
atom:summary and atom:content: description of the publication (as plain text or some other format for atom:content)
dc:language: language(s) of the publication (any format?)
dc:extend: size or duration of the publication (?)
dc:publisher: Publisher of the publication

Moreover each publication may link to related resources. Unfortunately you cannot just use arbitrary RDF properties but the following relations (from this draft):

alternate: alternative description of the publication
copyright: copyright statement that applies to the catalog entry
latest-version: more recent version of the publication
license: license associated with the catalog entry
replies: comment on or discussion of the catalog entry

I consider this relation types one of the weakest points of OPDS. The domain and range of the links are not clear and there are much better vocabularies for links between publications, for instance in FRBR, the Bibliographic Ontology, the citation type ontology, Memento, and SIOC (which also overlaps with ODPS at other places).

In addition each publication must contain at least one atom:link element which is used to encode an Aquisition Link.

Aquisition Links

OPDS defines two Aquisition types: „Direct Acquisition“ and „Indirect Acquisition“. Direct Aquisition links must directly lead to the publication (in some format) without any login, meta or catalog page in front of it (!) while Indirect Acquisition links lead to such a portal pages that then links to the publications. There are five Aquisition types (called „Acquisition Relations“) similar to DAIA Service types:

odps:acquisition: a complete representation of the
publication that may be retrieved without payment
odps:acquisition/borrow: a complete representation of the publication
that may be retrieved as part of a lending transaction
odps:acquisition/buy: a complete representation of the publication
that may be retrieved as part of a purchase
odps:acquisition/sample: a representation of a subset of the publication
odps:acquisition/subscribe: a complete representation of the publication that may be retrieved as part of a subscription

odps:acquisition can be mapped to daia:Service/Openaccess and odps:acquisition/borrow can be mapped to daia:Service/Loan (and vice versa). odps:acquisition/buy is not defined in DAIA but could easily be added while daia:Service/Presentation and daia:Service/Interloan are not defined in ODPS. At least the first should be added to ODPS to indicate publications that require you to become a member and log in or to physically walk into an institution to get a publication (strictly limiting OPDS to pure-digital publications accessible via HTTP is stupid if you allow indirect aquisition).

The remaining two acquisition types somehow do not fit between the others: odps:acquisition/sample and odps:acquisition/subscribe should be orthogonal to the other relations. For instance you could subscribe to a paid or to a free subscription and you could buy a subset of a publication.

In addition Aquisition links may or must contain some other properties such as odps:price (containing of a currency code from ISO4217 and a value).

Cover and artwork links

Beside Aquisition links the relations opds:cover and opds:thumbnail can be used to relate a Publication with it’s cover or some other visual representation. The thumbnail should not exceed 120 pixles in height or width and images must be either GIF, JPEG, or PNG. Thumbnails may also be directly embedded via the „data“ URL schema from RFC2397.

Final thoughts

OPDS looks very promising and it is already used for benefit in practise. There are some minor issues that can easily be fixed. The random selection of relation types is surely I flaw that can be repaired by allowing arbitrary RDF properties (come on XML fanboys, you should notice that RDF is good at least at link types!) and the list of acquisition types should be cleaned and enhanced at least to support „presentation“ without lending like DAIA does. A typical use case for this are National Licenses that require you to register to access the publications. For more details I would like to compare OPDS in more depth with models like DAIA, FRBR, SIOC, OAI-ORE, Europeana etc. – but not now.

Tags: ATOM, DAIA, Data Modeling, OPDS, openpub 7 Kommentare

First complete draft of DAIA Ontology

7. Januar 2010 um 19:06 6 Kommentare

I just finished the first complete draft of an OWL ontology of the DAIA data model. Unless the final URI prefix is sure, the ontology is available in GBV Wiki in Notation3 syntax, but you can also get RDF/XML. There is also a browsable HTML view created with OWLDoc (I only wonder why it does not include URI prefixes like in the same view of the Bibliographic Ontology).

It turned out that mapping the XML format DAIA/XML to RDF is not trivial – although I kept in mind doing so when I designed DAIA. XML is mostly based on a closed world tree data model but RDF is based on an open world graph model. Last month Mike Bergman wrote a good article about the clash of Open World Assumption and Closed World Assumption. I think as long as you only view data in form of tables, lists, and trees, you will not grasp the concept of the Semantic Web. I don’t know whether I have fully grasped the concept of document availability with DAIA and the ontology surely needs some further review, but it’s something to start with – just have a look!

Tags: DAIA, Data Modeling 6 Kommentare

How to encode the availability of documents

23. Oktober 2009 um 12:50 2 Kommentare

Since almost a year I work on a simple encoding format and API to just get the current (!) availability status of documents in libraries. Together with Reh Uwe (hebis network) and Anne Christensen (beluga project) we created the Document Availability Information API (DAIA) which is defined as data model with encoding in XML and JSON (whichever you prefer).

This week I finished and published a reference implementation of the DAIA protocol as open source Perl-module at CPAN. The implementation includes a simple DAIA validator and converter. A public installation of this validator is also available. The next tasks include implementing server and client components for several ILS software. Every library has its own special rules and schemas – Jonathan Rochkind already wrote about the problems to implement DAIA because of ILS complexity. We cannot erase this complexity by magic (unless we refactor and clean the ILS), but at least we can try to map it to a common data model – which DAIA provides.

With the DAIA Perl package you can concentrate on writing wrappers from your library systems to DAIA and easily consume and evaluate DAIA-encoded information. Why should everyone write its own routines to grab for instance the HTML OPAC output and parse availability status? One mapping to DAIA should fit most needs, so others can build upon. DAIA can not only be helpful to connect different library systems, but also to create mashups and services like „Show me on a map, where a given book is currently hold and available“ or „Send me a tweet if a given books in my library is available again“ – If you have more cool ideas for client applications, just let me know!

In the context of ILS Discovery Interface Task Force and their official recommendation DAIA implements the GetAvailability method (section 6.3.1). There are numerous APIs for several tasks in library systems (SRU/SRW, Z39.50, OpenSearch, OAI-PMH, Atom, unAPI etc.) but there was no open, usable standard way just to query whether a copy of given publication – for instance book – is available in a library, in which department, whether you can loan it or only use it in the library, whether you can directly get it online, or how long it will probably take until it is available again (yes, I looked at alternatives like Z39.50, ISO 20775, NCIP, SLNP etc. but they were hardly defined, documented, implemented and usable freely on the Web). I hope that DAIA is easy enough so non-librarians can make use of it if libraries provide an API to their system with DAIA. Extensions to DAIA can be discussed for instance in Code4Lib Wiki but I’d prefer to start with this basic, predefined services:

presentation: an item can be used inside the institution (in their rooms, in their intranet etc.).
loan: an item can be used outside of the institution (by lending or online access).
interloan: an tem can be used mediated by another institution. That means you do not have to interact with the institution that was queried for this item. This include interlibrary loan as well as public online ressources that are not hosted or made available by the queried institution.
openaccess: an item can be used imediately without any restrictions by the institution, you don’t even have to give it back. This applies for Open Access publications and free copies.

Tags: API, DAIA, Library, Perl 2 Kommentare

Class or Property? Objectification in RDF and data modeling

14. August 2009 um 00:23 14 Kommentare

A short twitter statement, in which Ross Singer asked about encoding MARC relator codes in RDF, reminded me of a basic data modeling question that I am thinking about for a while: When should you model something as class and when should you model it as property? Is there a need to distinguish at all? The question is not limited to RDF but fundamental in data/information modeling. In Entity-relationship modeling (Chen 1976) the question is whether to use an entity or a relation. Let me give an example by two subject-predicat-object statements in RDF Notation3:

:Work dc:creator :Agent :Agent rdf:type :Creator

The first statement says that a specific agent (:Agent) has created (dc:creator) a specific work (:Work). The second statement says that :Agent is a creator (:Creator). In the first dc:creator is a property while in the second :Creator is a class. You could define that the one implies the other, but you still need two different concepts because classes and properties are disjoint (at least in OWL – I am not sure about plain RDF). In Notation3 the implications may be written as:

@forAll X1, X2. { X1 dc:creator X2 } => { X2 a _:Creator }. @forAll Y1. { Y1 a _:Creator } => { @forSome Y2. Y2 dc:creator Y1 }.

If you define two URIs for class and property of the same concept (the concept of a creator and creating something) then the two things are tightly bound together: Everyone who ever created something is a creator, and to be a creator you must have created something. This logic rule sounds rather rude if you apply it to other concepts like to lie and to be a liar or to sing and to be a singer. Think about it!

Beside the lack of fuzzy logic on the Semantic Web I miss an easy way to do „reification“ (there is another concept called „reification“ in RDF but I have never seen it in the wild) or „objectification“: You cannot easily convert between classes and properties. In a closed ontology this is less a problem because you can just decide whether to use a class or a property. But the Semantic Web is about sharing and combining data! What if Ontology A has defined a „Singer“ class and Ontology B defined a „sings“ property which refer to the same real-world concept?

Other data modeling languages (more or less) support objectification. Terry Halpin, the creator and evangelist of Object-Role Modeling (ORM) wrote a detailed paper about objectification in ORM whithout missing to mention the underlying philosophical questions. My (doubtful)
philosophic intuition makes me think that properties are more problematic then classes because the latter can easily be modeled as sets. I think the need for objectification and to bring together classes and properties with similar meaning will increase, the more „semantic“ data we work with. In many natural languages you can use a verb or adjective as noun by nominalization. The meaning may slightly change but it is still very useful for communication. Maybe we should more rely on natural language instead of dreaming of defining without ambiguity?

Tags: Data Modeling, Identifier, ORM, rdf 14 Kommentare

« Vorige Seite — Nächste Seite »

Jakoblog — Das Weblog von Jakob Voß