OAI – Jakoblog

Working group on digital library APIs and possible outcomes

jakob — Sun, 13 Apr 2008 12:48:50 +0000

Last year the Digital Library Federation (DLF) formed the „ILS Discovery Interface Task Force„, a working group on APIs for digital libraries. See their agenda and the current draft recommendation (February, 15th) for details [via Panlibus]. I’d like to shortly comment on the essential functions they agreed on at a meeting with major library system (ILS) vendors. Peter Murray summarized the functions as „automated interfaces for offloading records from the ILS, a mechanism for determining the availability of an item, and a scheme for creating persistent links to records.“

On the one hand I welcome if vendors try to agree on (open) standards and service oriented architecture. On the other hand the working group is yet another top-down effort to discuss things that just have to be implemented based on existing Internet standards.

1. Harvesting: In the library world this is mainly done via OAI-PMH. I’d also consider RSS and Atom. To fetch single records, there is unAPI – which the DLF group does not mention. There is no need for any other harvesting API – missing features (if any) should be integrated into extensions and/or next versions of OAI-PMH and ATOM instead of inventing something new. P.S: Google Wave shows what to expect in the next years.

2. Search: There is still good old overblown Z39.50. The near future is (slightly overblown) SRU/SRW and (simple) OpenSearch. There is no need for discussion but for open implementations of SRU (I am still waiting for a full client implementation in Perl). I suppose that next generation search interfaces will be based on SPARQL or other RDF-stuff.

2. Availability: The announcement says: „This functionality will be implemented through a simple REST interface to be specified by the ILS-DI task group“. Yes, there is definitely a need (in december I wrote about such an API in German). However the main point is not the API but to define what „availability“ means. Please focus on this. P.S: DAIA is now available.

3. Linking: For „Linking in a stable manner to any item in an OPAC in a way that allows services to be invoked on it“ (announcement) there is no need to create new APIs. Add and propagate clean URIs for your items and point to your APIs via autodiscovery (HTML link element). That’s all. Really. To query and distribute general links for a given identifier, I created the SeeAlso API which is used more and more in our libraries.

Furthermore the draft contains a section on „Patron functionality“ which is going to be based on NCIP and SIP2. Both are dead ends in my point of view. You should better look at projects outside the library world and try to define schemas/ontologies for patrons and patron data (hint: patrons are also called „customer“ and „user“). Again: the API itself is not underdefined – it’s the data which we need to agree on.

Wikisource im DFG-Viewer dank Schnittstellen

jakob — Mon, 31 Mar 2008 13:52:14 +0000

Der DFG-Viewer ist eine relativ neue Webanwendung zur Anzeige von Digitalisaten. Das von der Deutschen Forschungsgemeinschaft geförderte Projekt soll bei der Etablierung von Standards für Digitalisierungsprojekten helfen – und macht das dank Webservices und offener Standards schon recht gut.

Angestoßen von einem Hinweis auf die Sammlung Ponickau an der ULB Sachsen-Anhalt und eine anschließende Diskussion um die andauernden Verwirrungen bezüglich URI, URN, URL Identifikatoren und Lokatoren, habe ich mir den DFG-Viewer etwas näher angesehen. Die Darstellung sieht nicht ganz so cool aus, wie bei The Open Library, dafür gibt es offene Schnittstellen. Digitalisate können dem Viewer per OAI oder direkter URL im METS/MODS-Format übergeben werden. Die einzelnen Seiten eines digitalisierten Buches und dessen innere Struktur (Gliederung) lassen sich dann durchblättern. Eine Volltextsuche ist anscheinend noch nicht implementiert und es fehlt eine eigene Zoom-Funktion; bislang ist es nur möglich zwischen verschieden großen Auflösungen zu wechseln, falls diese vom Repository ausgeliefert werden.

Ein Exemplar des auf INETBIB als Beispiel genannten Buches mit der VD17-Nummer 32:623995L ist in Halle digitalisiert vorhanden. Die Metadaten des Digitalisates können per OAI in METS/MODS abgerufen werden. Übergibt man nun dem DFG-Viewer die URL, kann das Digitalisat im DFG-Viewer betrachtet werden. Im Moment ist noch ein Schritt Handarbeit notwendig, da im DFG-Viewer ein falscher (?) OAI-Server für Halle eingetragen ist, aber grundsächtlich funktioniert das Mashup.

Statt spaßeshalber eine METS-Datei mit Pornobildchen zusammenzustellen, um sie im DFG-Viewer anzeigen zu lassen, habe ich mir ein zufälliges Digitalisat von Wikisource vorgenommen. In Wikisource gibt es für jedes Digitalisat eine Indexseite, auf der einige Metadaten und die Seiten der digitalisierten Vorlage aufgelistet sind. Aus dieser Seite kann eine METS/MODS-Datei erzeugt und an den DFG-Viewer geschickt werden. Zwei bis drei Stunden später steht ein einfaches Perl-Skript, dass aus der Index-Seite in Wikisource eine METS-Datei erzeugt. Und so sieht es im DFG-Viewer aus (Draufklicken=größere Ansicht):

Das ganze ist nur ein schnell gehackter Proof-of-concept. Eine stabile Verwendung der Metadaten aus Wikisource sollte aus einer OAI-Schnittstelle bestehen, die METS/MODS liefert (und MABXML für ZVDD). Falls jemand Interesse hat (Bachelor/Diplomarbeit, eigenes Projekt etc.), biete ich gerne meine Unterstützung an – umsetzen muss er es jedoch erstmal jemand anderes da ich nicht dauernd nur neue Projekte anfangen kann.

First draft of OAI-ORE

jakob — Sun, 30 Dec 2007 17:06:09 +0000

„Web 3.0“ (or „Semantic Web“ – use the buzzword of your choice) is slowly on the raise. Two weeks ago the first public draft of OAI-ORE was published and Mike Giarlo published an OAI-ORE-Plugin for WordPress – I have not actually tried it, but as far as I understand one could add RFC 5005 to OAI-ORE to support large resource sets. Or is OAI-PMH enough? Well, in the end it depends on the availability of software libraries, client and the ease of connecting it with other services. After my fancy there are still too much generalized data models but we need concrete implementations – it was not RDF and OWL but Microformats that got the Web of data started (yes, we’re in it: the next hype after „Web 2.0“). For 2008 I wish less abstract meta-meta-meta-stuff but, more little usable applications and services that can be combined.

Relevant APIs for (digital) libraries

jakob — Fri, 30 Nov 2007 13:50:11 +0000

My current impression of OCLC/WorldCat Service Grid is still far to abstract – instead of creating a framework, we (libraries and library associations) should agree upon some open protocols and (metadata) formats. To start with, here is a list of relevant, existing open standard APIs from my point of view:

Search: SRU/SRW (including CQL), OpenSearch, Z39.50

Harvest/Syndicate: OAI-PMH, RSS, Atom Syndication (also with ATOM Extensions)

Copy/Provide: unAPI, COinS, Microformats (not a real API but a way to provide data)

Upload/Edit: SRU Update, Atom Publishing Protocol

Identity Management: Shibboleth (and other SAML-based protocols), OpenID (see also OSIS)

For more complex applications, additional (REST)-APIs and common metadata standards need to be found (or defined) – but only if the application is just another kind of search, harvest/syndicate, copy/provide, upload/edit, or Identity Management.

P.S: I forgot NCIP, a „standard for the exchange of circulation data“. Frankly I don’t fully understand the meaning and importance of „circulation data“ and the standard looks more complex then needed. More on APIs for libraries can be found in WorldCat Developer Network, in the Jangle project and a DLF Working group on digital library APIs. For staying in the limited world if libraries, this may suffice, but on the web simplicity and availability of implementations matters – that’s why I am working on the SeeAlso linkserver protocol and now at a simple API to query availaibility information (more in August/September 2008).

P.P.S: A more detailed list of concrete library-related APIs was published by Roy Tennant based on a list by Owen Stephens.

P.P.S: And another list by Stephen Abram (SirsiDynix) from September 1st, 2009

Archiving Weblogs with ATOM and RFC 5005: An alternative to OAI-PMH

jakob — Fri, 19 Oct 2007 09:34:45 +0000

Following up to my recent post (in German) I had a conversation with my colleague about harvesting and archiving blogs and ATOM vs OAI-PMH. In my opinion with the recent RFC 5005 about Feed Paging and Archiving and its proposed extension of Archived Feeds ATOM can be an alternative to OAI-PMH. Instead of arguing which is better, digital libraries should support both for harvesting and providing archived publications such as preprints and weblog entries (scientific communication and publication already takes place in both).

Instead of having every project to implementing both protocols you could create a wrapper from ATOM with archived feeds to OAI-PMH and vice versa. The mapping from OAI-PMH to ATOM is probably the easier part: You partition the repository into chunks as defined in RFC 5005 with the from and until arguments of OAI-PMH. The mapping from OAI-PMH to ATOM is more complicated because you cannot select with timestamps. If you only specify a fromargument, the corresponding ATOM feed could be harvested going backwards in time but if there is an until argument you must harvest the whole archive just to get the first entries and and throw away the rest. Luckily the most frequent use case is to get the newest entries only. Anyway: Both protocols have their pros and cons and a two-way-wrapper could help both. Of course it should be implemented as open source so anyone can use it (by the way: There seems to be no OAI-crawler in Perl yet: Sure there is OAI-Harvester but for real-world applications you have to deal with unavailable servers, corrupt feeds, duplicated or deleted entries, and a way to save the harvested records, so a whole layer above the harvester is missing).

P.S.: At code4lib Ed Summers pointed me to Stuart Weibel who asked the same question about blog archiving, and to a discussion in John Udell’s blog that include blog archiving (he also mentions BlogML as a possible part of a solution – unluckily BlogML looks very dirty to me, the spec is here). And Daniel Chudnov drafted a blog mirroring architecture.

Weblogs Sammeln, Erschließen, Verfügbar machen und Archivieren

jakob — Fri, 19 Oct 2007 01:03:51 +0000

Ich ärgere mich ja schon seit längerer Zeit, dass praktisch keine Bibliotheken Weblogs sammeln und archivieren, obwohl diese Mediengattung bereits jetzt teilweise die Funktion von Fachzeitschriften übernimmt. Inzwischen kann ich unter den Kollegen zwar ein steigendes Interesse an Blogs feststellen (der nächste Workshop war nach kurzer Zeit ausgebucht), aber so richtig ist bei der Mehrheit noch nicht angekommen, dass hier eine mit der Einführung des Buchdrucks oder Erfindung von Zeitschriften vergleichbare Evolution im Gange ist. Ansonsten sollten doch viel mehr Bibliotheken damit beginnen Weblogs zu Sammeln, Erschließen, Verfügbar zu machen und zu Archivieren.

Anstatt erstmal darüber zu diskutieren, in welche MAB-Spezialfelder die Daten kommen und als was für eine Mediengatung Weblogs gelten, müsste nur mal einer der existierenden Open Source-Feedreader aufgebohrt werden, so dass er im großen Maßstab auf einem oder mehreren Servern läuft und wenigstens jene Feeds sammelt, die irgend ein Bibliothekar mal als sammlungswürdig eigestuft hat. Alles was wohlgeformtes XML und mit einem Mindestsatz an obligatorischen Elementen (Autor [Zeichenkette], Titel [Zeichenkette], Datum [ISO 8061], Inhalt [Zeichenkette]) ausgestattet ist, dürfte doch wenigstens so archivierbar sein, dass sich der wesentliche Teil rekonstruieren lässt – Besonderheiten wie HTML-Inhalte, Kategorien und Kommentare können ja später noch dazu kommen, wenn die Infrastruktur (Harvester zum Sammeln, Speicher zum Archivieren, Index zum Erschließen und eine Lesemöglichkeit zum Verfügbar machen) steht.

Für die Millionen von Blogartikeln, die bislang verloren sind (abgesehen von den nicht für die Archivierung zur Verfügung stehenden Blogsuchmaschinen wie Bloglines, Technorati, Google Blogsearch, Blogdigger etc.) gibt es zumindest teilweise Hoffnung:

Im September wurde RFC 5005: Feed Paging and Archiving definiert eine (auch in RSS mögliche) Erweiterung des ATOM-Formats, bei der vom Feed der letzten Einträge auf die vorhergehenden Einträge und/oder ein Archiv verwiesen wird. Im Prinzip ist das schon länger möglich und hier an einem Beispiel beschrieben, aber jetzt wurde es nochmal etwas genauer spezifiziert. Damit ist ATOM eine echte Alternative zum OAI-PMH, das zwar der Bibliothekswelt etwas näher steht, aber leider auch noch etwas stiefmütterlich behandelt wird.

Wie auch immer: Bislang werden Blogs nicht systematisch und dauerhaft für die Nachwelt gesammelt und falls Bibliotheken überhaupt eine Zukunft haben, sind sie die einzigen Einrichtungen die dafür wirklich in Frage kommen. Dazu sollte in den nächsten Jahren aber die „Erwerbung“ eines Blogs für den Bibliotheksbestand ebenso vertraut werden wie die Anschaffung eines Buches oder einer Zeitschrift. Meinetwegen können dazu auch DFG-Anträge zur „Sammlung und Archivierung des in Form von Weblogs vorliegenden kulturellen Erbes“ gestellt werden, obgleich ich diesem Projektwesen eher skeptisch gegenüber bin: Die Beständige Weiterentwicklung von Anwendungen als Open Source bringt mehr und es wird auch weniger häufig das Rad neu erfunden.

P.S.: Auf der Informationsseite der DNB zur Sammlung von Netzpublikationen findet sich zu Weblogs noch nichts – es liegt also an jeder einzelnen Bibliothek, sich mal Gedanken über die Sammlung von für Sie relevanten Weblogs zu machen.

Second day at MTSR

jakob — Thu, 18 Oct 2007 16:46:04 +0000

It is already a week ago (conference blogging should be published immediately) so I better summarize my final notes of the MTSR conference 2007: The second day started with the keynote speech Using semantics to enhance b2b integration by Jorge Cardoso. He reported from a survey about creation and usage of Ontologies – I hope he soon publishes the slides or a paper because there were too many details to remember. The second part of his speech was about Service Oriented Architecture (SOA) and Webservices – all highly relevant but less new to me.

After coffee break Haibo Jia presented A new Formal Concept Analysis based learning approach to Ontology building. He extracted 2201 different author keywords and 107 different classification terms from 900 documents of the ACM, added stemming and used the result to create an ontology for query expansion. I hope to follow his work with extraction and analyzing author keywords and classification terms too.

I could not follow the following presentation Formalizing Dublin Core Application Profiles Description Set Profiles and Graph Constraints by Fredrik Enoksson although I would have liked to get to now Alistair’s work. Dublin Core Application Profiles will surely become more important in the future but this presentation did not help me a lot – maybe I was just inattentive because of my own presentation.

So I switched to another session and saw one of my highlights: Panayiota Polydoratou presented the project Repository Interface for Overlaid Journal Archives (RIOJA). To prepare the establishement of an „Overlay Journal“ a detailed survey was held among researchers in the astrophysics and cosmology community. The overlay journal concept was already coined by Ginsparg in 1996: a peer-reviewed journal is organized on top of a repository like arXiv.org. The survery showed that most scientists are not interested in a printed journal but in quality control and copyediting. Everything else can already be done with the repository (which the asked scientists already use a lot).

By the way: RIOJA is not the only project to prepare and create an overlay journal. I found the JISC founded Overlay journal infrastructure for Meteorological Sciences (OJIMS) and the Logical Methods in Computer Science (LCMS) which exists since 2005. Peter Murray pointed out that the blog TotallySynthetic.com and the digital library Crystaleye for instance can be called overlay journals. Looks like the concept of a journal is highly evolving (Wake up librarians! When will you finally start to collect weblogs? The Times They Are A-Changin‘!). Ok, back to the conference, I will later think more about Overlay Journals, OpenAccess, Science 2.0, OAI-ORE…

In the next session I saw Fredrik Enoksson with An RDF Modification Protocol, based on the Needs of Editing Tools. He explainded that usually RDF is stored in triple stores and there is a need for a remotely editing RDF. In the luisa project a remote editing protocol for RDF is developed and the existing SPARQL update language needs to be extended to fullfill the needs of editing applications.

Beside the low number of participants my own presentation went very well – I got two good questions that helped me to think more about an SKOS encoding of country codes. The slides are at slideshare and a paper will follow.

Finally there were also some posters – but no real poster session to find the corresponding authors of a particular poster. A poster titled Combining Collaborative Tagging and Ontologies in Image Retrieval Systems attracted my attention because of the topic, but it looked very drafted and the explanation contained several gaps so I cannot say whether it contained some valuable research or not. I think the basic idea was to enrich a query with additional synonyms and query a combined index of automatically generated index terms, controlled terms assigned by the creator, and uncontrolled tags.

Video recordings of the presentations will be published in the next weeks and post-proceedings will be published in springer’s Lecture Notes in Computer Science (LNCS) series.

Syndication and Harvesting with RSS, ATOM, OAI-PMH and Sitemaps

jakob — Fri, 28 Sep 2007 10:32:16 +0000

On my quest for metadata formats and APIs I found that ATOM is not just another RSS but more like a simple database language. Google’s Data API GData strongly pushes ATOM forward (but may also introduce some problems). Jim Downing wrote about ATOM, OAI-PMH, and Sitemaps – three different ways to provide a list of all the resources in a collection, and to incrementally discover changes. OAI-PMH is much less prominent, but why?

Andy Powell started a very lightening discussion with his talk at the JISC Digital repositories conference 2007. He complains that repositories are partly missing the web – popular we-could-also-call-them-repositories like Flickr, Slideshare, YouTube, Scribd etc. don’t use OAI-PMH nor does Google support it. Following the discussion I ask myself what the differences are between scholarly communication and people uploading and mixing any popular content. And do the differences justify different methods of syndication and harvesting? Have a look at the comments by Herbert van de Sompel and Erik Hetzner!

OAI Object Re-Use and Exchange (OAI-ORE)

jakob — Fri, 27 Apr 2007 13:52:28 +0000

Lambert verweist im netbib weblog auf das Projekt OAI Object Re-Use and Exchange (OAI-ORE) der Open Archives Initiative (OAI) aufmerksam gemacht, die uns schon vor sechs Jahren (sic!) das wunderbar einfache OAI Protocol for Metadata Harvesting (OAI-PMH) beschert hat. Leider ist OAI in der bibliothekarischen Praxis noch immer nicht so reibungslos eingebunden, wie es sein könnte. Data-Provider fallen immer mal wieder aus oder liefern kaputte Daten (was auf das gleiche hinausläuft), Harvesting-Prozesse sind nicht in automatische Workflows eingebunden und die Qualität der Metadaten ist … naja, was soll man erwarten wenn keine automatischen Prüfroutinen installiert sind. Lambert weist auf die letzte Woche am CERN (wo übrigens das WWW erfunden wurde) stattgefundene Veranstaltung Agenda for the 5th Workshop on Innovations in Scholarly Communication hin, auf der unter Anderem Herbert Van De Sompel OAI-ORE präsentiert (Video leider nur in einem bescheurten-Format). Bei der Gelegenheit habe ich den noch sehr übersichtlichen Wikipedia-Eintrag zu OAI um OAI-ORE ergänzt – Ergänzungen und Korrekturen sind selbstverständlich sehr willkommen. Der Verweis auf Social Software-Dienste im Zusammenhang mit ORE ist mir allerdings etwas zu allgemein. Eher relevant dürfte die Technik im Rahmen von Projekten wie TextGrid und Vorhaben im FP7 der EU sein – dafür ist OAI-ORE aber wahrscheinlich noch zu neu und innovativ und solange Deutschland es nicht gebacken bekommt, international mitzuwirken (Siehe Anzahl der deutschen Mitglieder in der OAI-ORE community) erwarte ich nicht, dass hier so schnell etwas passiert. Einige Gedanken von Pete Johnston zu ORE und Web Architecture finden sich in diesem Beitrag aus dem Januar.