Abbreviated URIs with rdfns

9. September 2014 um 11:26 Keine Kommentare

Working with RDF and URIs can be annoying because URIs such as „http://purl.org/dc/elements/1.1/title“ are long and difficult to remember and type. Most RDF serializations make use of namespace prefixes to abbreviate URIs, for instance „dc“ is frequently used to abbreviate „http://purl.org/dc/elements/1.1/“ so „http://purl.org/dc/elements/1.1/title“ can be written as qualified name „dc:title„. This simplifies working with URIs, but someone still has to remember mappings between prefixes and namespaces. Luckily there is a registry of common mappings at prefix.cc.

A few years ago I created the simple command line tool rdfns and a Perl library to look up URI namespace/prefix mappings. Meanwhile the program is also available as Debian and Ubuntu package librdf-ns-perl. The newest version (not included in Debian yet) also supports reverse lookup to abbreviate an URI to a qualified name. Features of rdfns include:

look up namespaces (as RDF/Turtle, RDF/XML, SPARQL…)

$ rdfns foaf.ttl foaf.xmlns dbpedia.sparql foaf.json

@prefix foaf:  .
xmlns:foaf="http://xmlns.com/foaf/0.1/"
PREFIX dbpedia: 
"foaf": "http://xmlns.com/foaf/0.1/"

expand a qualified name

$ rdfns dc:title

http://purl.org/dc/elements/1.1/title

lookup a preferred prefix

$ rdfns http://www.w3.org/2003/01/geo/wgs84_pos#

geo

create a short qualified name of an URL

$ rdfns http://purl.org/dc/elements/1.1/title

dc:title

I use RDF-NS for all RDF processing to improve readability and to avoid typing long URIs. For instance Catmandu::RDF can be used to parse RDF into a very concise data structure:

$ catmandu convert RDF --file rdfdata.ttl to YAML

Testing command line apps with App::Cmd

1. November 2013 um 10:49 1 Kommentar

This posting has also been published at blogs.perl.org.

Ricardo Signes‘ App::Cmd has been praised a lot so I gave it a try for my recent command line app. In summary, the module is great although I missed some minor features and documentation (reminder to all: if you miss some feature in a CPAN module, don’t create yet another module but try to improve the existing one!). One feature I like a lot is how App::Cmd facilitates writing tests for command line apps. After having written a short wrapper around App::Cmd::Tester my formerly ugly unit tests look very simple and clean. Have a look at this example:

use Test::More;
use App::PAIA::Tester;

new_paia_test;

paia qw(config);
is stdout, "{}\n";
is error, undef;

paia qw(config -c x.json --verbose);
is error, "failed to open config file x.json\n";
ok exit_code; 

paia qw(config --config x.json --verbose foo bar);
is output, "# saved config file x.json\n";

paia qw(config foo bar);
paia qw(config base http://example.org/);
is exit_code, 0;
is output, '';

paia qw(config);
is_deeply stdout_json, { 
    base => 'http://example.org/',
    foo => 'bar',
}, "get full config"

done_paia_test;

The application is called paia – that’s how it called at command line and that’s how it is simply called as function in the tests. The wrapper class (here: App::PAIA::Tester) creates a singleton App::Cmd::Tester::Result object and exports its methods (stdout, stderr, exit_code…). This alone makes the test much more readable. The wrapper further exports two methods to set up a testing environment (new_paia_test) and to finish testing (done_paia_test). In my case the setup creates an empty temporary directory, other applications might clean up environment variables etc. Depending on your application you might also add some handy functions like stdout_json to parse the app’s output in a form that can better be tested.

URI namespace lookup with prefix.cc and RDF::NS

3. November 2011 um 17:13 Keine Kommentare

Probably the best feature of RDF is that it forces you to use Uniform Resource Identifiers (URI) instead of private, local identifiers which only make sense in a some context. URIs are long and cumbersome to type, so popular URIs are abbreviated with namespaces prefixes. For instance foaf:Person is expanded to http://xmlns.com/foaf/0.1/Person, once you have defined prefix foaf for namespace http://xmlns.com/foaf/0.1/. In theory URI prefixes in RDF are arbitrary (in contrast to XML where prefixes can matter, in contrast to popular belief). In practice people prefer to agree to one or two known prefixes for common URI namespaces (unless they want to obfuscate human readers of RDF data).

So URI prefixes and namespaces and are vital for handling RDF. However, you still have to define them in almost every file and application. In the end people have copy & paste the same prefix definitions again and again. Fortunately Richard Cyganiak created a registry of popular URI namespaces, called prefix.cc (it’s open source), so people at least know where to copy & paste from. I had enough of copying the same URI prefixes from prefix.cc over and over again, so I created a Perl module that includes snapshots of the prefix.cc database. It includes a simple command line client, that is installed automatically:

  $ sudo cpanm RDF::NS
  $ rdfns rdf,foaf.ttl
  @prefix foaf: <http: //xmlns.com/foaf/0.1/> .
  @prefix rdf: <http: //www.w3.org/1999/02/22-rdf-syntax-ns#> .

In your Perl code, you can use it like this:

  use RDF::NS
  my $NS = RDF::NS->new('20111102');
  $NS->foaf_Person; # returns "http://xmlns.com/foaf/0.1/Person"

If you miss an URI prefix, just add it at http://prefix.cc, and will be included in the next release.

How to encode the availability of documents

23. Oktober 2009 um 12:50 1 Kommentar

Since almost a year I work on a simple encoding format and API to just get the current (!) availability status of documents in libraries. Together with Reh Uwe (hebis network) and Anne Christensen (beluga project) we created the Document Availability Information API (DAIA) which is defined as data model with encoding in XML and JSON (whichever you prefer).

This week I finished and published a reference implementation of the DAIA protocol as open source Perl-module at CPAN. The implementation includes a simple DAIA validator and converter. A public installation of this validator is also available. The next tasks include implementing server and client components for several ILS software. Every library has its own special rules and schemas – Jonathan Rochkind already wrote about the problems to implement DAIA because of ILS complexity. We cannot erase this complexity by magic (unless we refactor and clean the ILS), but at least we can try to map it to a common data model – which DAIA provides.

Core components of the DAIA data model

With the DAIA Perl package you can concentrate on writing wrappers from your library systems to DAIA and easily consume and evaluate DAIA-encoded information. Why should everyone write its own routines to grab for instance the HTML OPAC output and parse availability status? One mapping to DAIA should fit most needs, so others can build upon. DAIA can not only be helpful to connect different library systems, but also to create mashups and services like „Show me on a map, where a given book is currently hold and available“ or „Send me a tweet if a given books in my library is available again“ – If you have more cool ideas for client applications, just let me know!

In the context of ILS Discovery Interface Task Force and their official recommendation DAIA implements the GetAvailability method (section 6.3.1). There are numerous APIs for several tasks in library systems (SRU/SRW, Z39.50, OpenSearch, OAI-PMH, Atom, unAPI etc.) but there was no open, usable standard way just to query whether a copy of given publication – for instance book – is available in a library, in which department, whether you can loan it or only use it in the library, whether you can directly get it online, or how long it will probably take until it is available again (yes, I looked at alternatives like Z39.50, ISO 20775, NCIP, SLNP etc. but they were hardly defined, documented, implemented and usable freely on the Web). I hope that DAIA is easy enough so non-librarians can make use of it if libraries provide an API to their system with DAIA. Extensions to DAIA can be discussed for instance in Code4Lib Wiki but I’d prefer to start with this basic, predefined services:

  • presentation: an item can be used inside the institution (in their rooms, in their intranet etc.).
  • loan: an item can be used outside of the institution (by lending or online access).
  • interloan: an tem can be used mediated by another institution. That means you do not have to interact with the institution that was queried for this item. This include interlibrary loan as well as public online ressources that are not hosted or made available by the queried institution.
  • openaccess: an item can be used imediately without any restrictions by the institution, you don’t even have to give it back. This applies for Open Access publications and free copies.

Zwei Jahre PICA::Record

20. Juli 2009 um 17:06 2 Kommentare

Heute vor zwei Jahren habe ich die erste öffentliche Version von PICA::Record auf CPAN hochgeladen. Das Comprehensive Perl Archive Network (CPAN) ist ein umfassendes Repository von Open-Source-Modulen für die Programmiersprache Perl. Mit Perl habe ich erst relativ spät angefangen, die die Sprache nicht sauber definiert und für ihre mögliche Unleserlichkeit bekannt ist. Andererseits trifft zu, was Larry Wall, der Autor von Perl 1999 sagte:

The very fact that it’s possible to write messy programs in Perl is also what makes it possible to write programs that are cleaner in Perl than they could ever be in a language that attempts to enforce cleanliness.

Die Tatsache, dass Programme (und damit ist hier der Quellcode gemeint) als „schön“ bezeichnet werden können zeigt, dass Programmieren auch als eine Kunst angesehen werden kann – und die Bühne für Perl ist dabei CPAN 🙂 Übrigens habe ich bislang noch keine schöne kommerzielle Bibliothekssoftware gesehen – aber Bibliotheken geht es beim Erwerb von Software ja auch weniger darum, dass sie etwas schönes und sinnvolles mit der Software anfangen können, sondern darum dass sie die Verantwortung an einen Softwarehersteller abschieben können.

Das Modul PICA::Record hat wahrscheinlich nur einen ziemlich begrenzten Anwenderkreis, da das PICA+ Datenformat sogar bei vielen Bibliothekaren eher unbekannt ist. Inzwischen ist wahrscheinlich PICA::Record mit allen Beschreibungen, Tests und Beispielen selbst die umfangreichste Dokumentation zu PICA+. Seit dem Bibliothekstag 2009 gibt es auch eine Kurzbeschreibung als Faltblatt auf Deutsch („Verarbeiten von PICA+ Daten mit PICA::Record„). Die aktuelle Version enthält als neuestes die Möglichkeit, PICA-Daten in einer SQL-Datenbank (bislang: SQLite) zu speichern (PICA::SQLiteSTore) und über ein Wiki (PICA+Wiki) darauf zuzugreifen. Für kommende Versionen ist der Ausbau dieses „CMS-Light“, einer Erweiterung der SOAP-API zum Lesen und Schreiben von Datensätzen sowie eine bessere Unterstützung von Lokaldaten geplant.

Sicher gibt es schönere Programmiersprachen als Perl, aber wenn schon mehr Personen im Bibliotheksumfeld programmieren (oder zumindest skripten) lernen – was unbedingt notwendig ist – könnte Perl die richtige Wahl sein, da sich mit PICA::Record bereits nach kurzer Zeit praxistaugliche Ergebnisse erzielen lassen. Zum deutschsprachigen Austausch zwischen Entwicklern im Bibliotheksbereich gibt es übrigens die Mailingliste bibcode.

Ariadne article about SeeAlso linkserver protocol

13. November 2008 um 11:32 Keine Kommentare

The current issue of Ariadne which has just been published contains an article about the „SeeAlso“ linkserver protocol: Jakob Voß: „SeeAlso: A Simple Linkserver Protocol„, Ariadne Issue 57, 2008.

SeeAlso combines OpenSearch and unAPI to a simple API that delivers list of links. You can use it for dynamically embedding links to recommendations, reviews, current availability, reviews, search completion suggestions, etc. It’s no rocket science but I found a well defined API with reusable server and client better then having to hack a special format and lookup syntax for each single purpose.

The reference client is written in JavaScript and the reference server is written in Perl. Implementing it in any other language should not be complicated. I’d be happy to get some feedback either in form of code, applications, or criticism. 🙂 I noted that SeeAlso::Server is the only implementation of unAPI at CPAN so far – if someone is interested, we could extract parts into an independent unAPI package. The WWW::OpenSearch::Description package is also worth to consider for use in SeeAlso::Server.

Citation parsing

24. Januar 2008 um 19:09 6 Kommentare

Citation Analysis is used to rate authors (problematic) and to find interesting papers (good idea). Citations of papers at the famous arXiv.org preprint server are analysed by CiteBase which is very useful. Unluckily it is buggy and does not alway work. I really wonder why the full text of a paper is parsed instead of using the BibTeX source. The citation parser ParaCite has been developed in the Open Citation Project. Since then it seems to be more or less abandoned. But it’s open source so you can test you papers before uploading and one could take the suiting parts to build a better citation parser. I found out that this way you can extract citations out of a document in $file (for instance a pdf) with perl (the needed modules are available at CPAN):

my $parser = Biblio::Citation::Parser::Citebase->new;
my $content = Biblio::Document::Parser::Utils::get_content( $file );
my $doc_parser = Biblio::Document::Parser::Brody->new;
my @references = $doc_parser->parse($content);

for (my $i=0; $i < @references; $i++) {
    my $metadata = $parser->parse( $references[$i] );
    print '[' . ($i+1) . '] ' . Dumper( $metadata ) . "\n";
}

In the documented that I tested there are almost always parsing errors, but better then nothing. I wonder what CiteSeer uses to extract citations? There is more action in citation parsing in the Zotero project – even an IDE called Scaffold to create new „translators“ that extract bibliographic data out of webpages. Another playing ground is Wikipedia which contains a growing number of references. And of course there are the commericla citation indexes like SCI. I thought to use citation data for additional catalog enrichement (in addition to ISBN2Wikipedia) but quality of data seems to be too low and identifiers are missing.

P.S: Right after writing this, I found Alf Eaton’s experiment with collecting together the conversations around a paper from various academic, news, blog and other discussion channels – as soon as you have identifiers (ISBN, URL, DOI, PMID…) the world gets connected 🙂

P.P.S: ParsCit seems to be a good new reference string parsing package (open source, written in Perl).

P.P.S: Konstantin Baierer manages a bibliography on citation parsing for his parser Citation::Multi::Parser.

First Draft of SeeAlso Simple linkserver API

14. Januar 2008 um 23:45 2 Kommentare

I finally finished the first draft of the SeeAlso Simple Specification SeeAlso. The webservice API for simple linkservers is based on OpenSearch Suggestions. It will be completed by the „SeeAlso Full Specification“ which adds unAPI and OpenSearch Description documents. The service is already implemented and running at the Union Catalog of GBV, see my earlier posting or simply have a look at this example. The same service is installed for testing at the Wikimedia Toolserver. For Wikipedia I drafted a different client that can be tested via a Bookmarklet: Drag the Link „ISBN2W“ to you bookmark toolbar, visit an English or German Wikipedia article with ISBNs on it, click the Bookmarklet link in your toolbar and hover the the ISBNs (that will become yellow) with the mouse. German Wikipedia users can also enable the service by adding a line of JavaScript to their user profile.

An implementation (SeeAlso Simple and SeeAlso Full) in Perl is available at CPAN (there was neither unAPI nor OpenSearch Suggestions, so I implemented both). Please note that ISBN to Wikipedia is only one example how the API can be used. In my opinion the concept of a linkserver is highly undervalued but will become more important again (for instance as a lightweigth alternative to simple SPARQL queries). Feedback and usage is welcome!

GBV bietet COinS an

1. November 2007 um 16:12 3 Kommentare

Wie ich eben in INETBIB schrieb, beginnt der GBV jetzt damit, für seine Kataloge COinS anzubieten. Damit können bibliographische Metadaten einfach aus dem Katalog in eigene Anwendungen übernommen werden. Zu den populären Anwendungen zur Literaturverwaltung gehört das Firefox-Plugin Zotero, welches soeben ein Jahr alt georden und in der Version 1.0 herausgekommen ist. Wie Zotero zusammen mit anderen Social-Software-Werkzeugen und dem Firefox-Abkömmling Flock in eine hocheffizienten Arbeitsplatz integriert werden kann, zeigt dieses Video. Eine anderes Bibliographiewerkzeug, dass COinS unterstützen soll ist Citavi.

Wie Patrick bemerkte ist es bereits seit Anfang August möglich, mit Zotero Daten aus dem GBV zu übernehmen. Dazu hatte der Zotero-Entwickler Ramesh Srigiriraju einen „Scraper“ in JavaScript geschrieben, der das RIS-Format interpretieren kann, welches von der PSI-Katalogsoftware mit dem Parameter PRS=RIS exportiert wird – wer genauer sehen möchte, wie so etwas funktioniert, sollte im Quelltext nach „GSO“ suchen.

Das Angebot von Daten via COinS ist wesentlich leichter nutzbar – sucht im im GSO-Katalog einfach mal nach einem Titel und schaut dann im HTML-Quelltext nach der Zeichenkette „Z3988“. Damit das Angebot auch in lokalen OPACs nutzbar ist, sollten meiner Meinung nach die einzelnen GBV-Bibliotheken erstmal selber Zotero ausprobieren und überlegen, wie sie das Angebot ihren Benutzern am Besten bekannt machen können.

Für die ganz harten Bibliothekshacker hier ein Stück Perl, um COinS für eigene Anwendungen aus beliebigen Webseiten auszulesen:

use HTML::TreeBuilder::XPath;
use LWP::UserAgent;
use URI::OpenURL;
use URI;
my $url = URI->new(shift @ARGV) or die "Please specify an URL!";
my $ua = LWP::UserAgent->new();
my $res = $ua->get($url);
my $html = $res->decoded_content;
my $tree= HTML::TreeBuilder::XPath->new_from_content( $html );
my @coins = $tree->findnodes( '//span[@class="Z3988"]/@title');
@coins = map { URI::OpenURL->new("?" . $_->getValue() ); } (@coins);
foreach my $c (@coins) {
  print $c->canonical->dump() . "\n";
}

GBV-Verbunddaten weiterverarbeiten mit SRU-Schnittstelle und Perl

20. August 2007 um 14:58 2 Kommentare

Ende Juli habe ich im Rahmen meiner Arbeit bei der VZG mit PICA::Record eine Perl-API zur Verarbeitung von PICA+-Daten veröffentlicht. PICA+ ist das interne Katalogformat von PICA-Bibliothekssystemen, die neben dem GBV und den Verbünden HeBIS und SWB auch bei der Deutschen Nationalbibliothek und für Zentralsysteme in den Niederlanden, Australien, Frankreich und England eingesetzt werden. Inzwischen ist PICA übrigens eine vollständige OCLC-Tochterfirma. Mehr zum PICA+ Format findet sich in den jeweiligen Katalogisierungsrichtlinien, zum Beispiel beim GBV und in dieser kurzen Einführung.

PICA::Record ist sozusagen ein Pendant zu Mike Rylanders CPAN-Modul MARC::Record, das bereits seit einigen Jahren bei MARC-Anwendern genutzt und in der Mailingliste perl4lib diskutiert wird. Feedback in Form von Anwendungen, Ideen, Bugreports etc. ist sehr willkommen – zum Beispiel öffentlich bei der Dokumentation im GBV-Wiki. Neben der Erzeugung von Datensätzen in PICA+, um diese in Katalogsysteme einzuspielen, eignet sich PICA::Record auch für die umgekehrte Richtung. Dazu ist ein einfacher SRU-Client implementiert; die entsprechende SRU-Schnittstelle bietet der GBV seit einiger Zeit inoffiziell und nun auch öffentlich an. Für Bibliotheks-Mashups ist die SRU-Schnittstelle ein Baustein und die Perl-API ein mögliches Bindemittel. Natürlich kann der Webservice auch mit anderen Methoden als mit Perl abgefragt werden.

Beispiele und Anleitungen gibt es unter Anderem in der API-Dokumentation, im Quelltext oder hier.