en – Jakoblog

Data models age like parents

jakob — Thu, 15 Mar 2018 19:51:45 +0000

Denny Vrandečić, employed as ontologist at Google, noticed that all six of of six linked data applications linked to 8 years ago (IWB, Tabulator, Disko, Marbles, rdfbrowser2, and Zitgist) have disappeared or changed their calling syntax. This reminded me at a proverb about software and data:

software ages like fish, data ages like wine.

‏
The original form of this saying seems to come from James Governor (@monkchips) who in 2007 derived it from from an earlier phrase:

Hardware is like fish, operating systems are like wine.

The analogy of fishy applications and delightful data has been repeated and explained and criticized several times. I fully agree with the part about software rot but I doubt that data actually ages like wine (I’d prefer Whisky anyway). A more accurate simile may be „data ages like things you put into your crowded cellar and then forget about“.

Thinking a lot about data I found that data is less interesting than the structures and rules that shape and restrict data: data models, ontologies, schemas, forms etc. How do they age compared with software and data? I soon realized:

data models age like parents.

First they guide you, give good advise, and support you as best as they can. But at some point data begin to rebel against their models. Sooner or later parents become uncool, disconnected from current trends, outdated or even embarrassing. Eventually you have to accept their quaint peculiarities and live your own life. That’s how standards proliferate. Both ontologies and parents ultimately become weaker and need support. And in the end you have to let them go, sadly looking back.

(The analogy could further be extended, for instance data models might be frustrated confronted by how actual data compares to their ideals, but that’s another story)

Wikidata documentation on the 2017 Hackathon in Vienna

jakob — Sun, 21 May 2017 13:21:39 +0000

At Wikimedia Hackathon 2017, a couple of volunteers sat together to work on the help pages of Wikidata. As part of that Wikidata documentation sprint. Ziko and me took a look at the Wikidata glossary. We identified several shortcomings and made a list of rules how the glossary should look like. The result are the glossary guidelines. Where the old glossary partly replicated Wikidata:Introduction, the new version aims to allow quick lookup of concepts. We already rewrote some entries of the glossary according to these guidelines but several entries are outdated and need to be improved still. We changed the structure of the glossary into a sortable table so it can be displayed as alphabetical list in all languages. The entries can still be translated with the translation system (it took some time to get familiar with this feature).

We also created some missing help pages such as Help:Wikimedia and Help:Wikibase to explain general concepts with regard to Wikidata. Some of these concepts are already explained elsewhere but Wikidata needs at least short introductions especially written for Wikidata users.

Image taken by Andrew Lih (CC-BY-SA)

Introduction to Phabricator at Wikimedia Hackathon

jakob — Sat, 20 May 2017 07:44:30 +0000

This weekend I participate at Wikimedia Hackathon in Vienna. I mostly contribute to Wikidata related events and practice the phrase "long time no see", but I also look into some introductionary talks.

In the late afternoon of day one I attended an introduction to Phabricator project management tool given by André Klapper. Phabricator was introduced in Wikimedia Foundation about three years ago to replace and unify Bugzilla and several other management tools.

Phabricator is much more than an issue tracker for software projects (although it is mainly used for this purpose by Wikimedia developers). In summary there are tasks, projects, and teams. Tasks can be tagged, assigned, followed,discussed, and organized with milestones and workboards. The latter are Kanban-boards like those I know from Trello, waffle, and GitHub project boards.

Phabricator is Open Source so you can self-host it and add your own user management without having to pay for each new user and feature (I am looking at you, JIRA). Internally I would like to use Phabricator but for fully open projects I don’t see enough benefit compared to using GitHub.

P.S.: Wikimedia Hackathon is also organized with Phabricator. There is also a task for blogging about the event.

Some thoughts on IIIF and Metadata

jakob — Fri, 05 May 2017 20:40:59 +0000

Yesterday at DINI AG Kim Workshop 2017 I Martin Baumgartner and Stefanie Rühle gave an introduction to the International Image Interoperability Framework (IIIF) with focus on metadata. I already knew that IIIF is a great technology for providing access to (especially large) images but I had not have a detailed look yet. The main part of IIIF is its Image API and I hope that all major media repositories (I am looking at you, Wikimedia Commons) will implement it. In addition the IIIF community has defined a „Presentation API“, a „Search API“, and an „Authentication API“. I understand the need of such additional APIs within the IIIF community, but I doubt that solving the underlying problems with their own standards (instead of reusing existing standards) is the right way to go. Standards should better „Do One Thing and Do It Well“ (Unix philosophy). If Images are the „One Thing“ of IIIF, then Search and Authentication are different matter.

In the workshop we only looked at parts of the Presentation API to see where metadata (creator, dates, places, provenance etc. and structural metadata such as lists and hierarchies) could be integrated into IIIF. Such metadata is already expressed in many other formats such as METS/MODS and TEI so the question is not whether to use IIIF or other metadata standards but how to connect IIIF with existing metadata standards. A quick look at the Presentation API surprised me to find out that the metadata element is explicitly not intended for additional metadata but only „to be displayed to the user“. The element contains an ordered list of key-value pairs that „might be used to convey the author of the work, information about its creation, a brief physical description, or ownership information, amongst other use cases“. At the same time the standard emphasizes that „there are no semantics conveyed by this information“. Hello, McFly? Without semantics conveyed it isn’t information! In particular there is no such thing as structured data (e.g. a list of key-value pairs) without semantics.

I think the design of field metadata in IIIF is based on a common misconception about the nature of (meta)data, which I already wrote about elsewhere (Sorry, German article – some background in my PhD and found by Ballsun-Stanton).

In a short discussion at Twitter Rob Sanderson (Getty) pointed out that the data format of IIIF Presentation API to describe intellectual works (called a manifest) is expressed in JSON-LD, so it can be extended by other RDF statements. For instance the field „license“ is already defined with dcterms:rights. Addition of a field „author“ for dcterms:creator only requires to define this field in the JSON-LD @context of a manifest. After some experimenting I found a possible way to connect the „meaningless“ metadata field with JSON-LD fields:

{
  "@context": [
    "http://iiif.io/api/presentation/2/context.json",
    { 
      "author": "http://purl.org/dc/terms/creator",
      "bibo": "http://purl.org/ontology/bibo/"
    }
  ],
  "@id": "http://example.org/iiif/book1/manifest",
  "@type": ["sc:Manifest", "bibo:book"],
  "metadata": [
    {
      "label": "Author",
      "property": "http://purl.org/dc/terms/creator",
      "value": "Allen Smithee"
    },
    { 
      "label": "License",
      "property": "http://purl.org/dc/terms/license",      
      "value": "CC-BY 4.0" 
    }
   ],
   "license": "http://creativecommons.org/licenses/by/4.0/",
   "author": {
     "@id": "http://www.wikidata.org/entity/Q734916",
     "label": "Allen Smithee"
   }
}

This solution requires an additional element property in the IIIF specification to connect a metadata field with its meaning. IIIF applications could then enrich the display of metadata fields for instance with links or additional translations. In JSON-LD some names such as „CC-BY 4.0“ and „Allen Smithee“ need to be given twice, but this is ok because normal names (in contrast to field names such as „Author“ and „License“) don’t have semantics.

Abbreviated URIs with rdfns

jakob — Tue, 09 Sep 2014 09:26:13 +0000

Working with RDF and URIs can be annoying because URIs such as „http://purl.org/dc/elements/1.1/title“ are long and difficult to remember and type. Most RDF serializations make use of namespace prefixes to abbreviate URIs, for instance „dc“ is frequently used to abbreviate „http://purl.org/dc/elements/1.1/“ so „http://purl.org/dc/elements/1.1/title“ can be written as qualified name „dc:title„. This simplifies working with URIs, but someone still has to remember mappings between prefixes and namespaces. Luckily there is a registry of common mappings at prefix.cc.

A few years ago I created the simple command line tool rdfns and a Perl library to look up URI namespace/prefix mappings. Meanwhile the program is also available as Debian and Ubuntu package librdf-ns-perl. The newest version (not included in Debian yet) also supports reverse lookup to abbreviate an URI to a qualified name. Features of rdfns include:

look up namespaces (as RDF/Turtle, RDF/XML, SPARQL…)

$ rdfns foaf.ttl foaf.xmlns dbpedia.sparql foaf.json

@prefix foaf:  .
xmlns:foaf="http://xmlns.com/foaf/0.1/"
PREFIX dbpedia: 
"foaf": "http://xmlns.com/foaf/0.1/"

expand a qualified name

$ rdfns dc:title

http://purl.org/dc/elements/1.1/title

lookup a preferred prefix

$ rdfns http://www.w3.org/2003/01/geo/wgs84_pos#

geo

create a short qualified name of an URL

$ rdfns http://purl.org/dc/elements/1.1/title

dc:title

I use RDF-NS for all RDF processing to improve readability and to avoid typing long URIs. For instance Catmandu::RDF can be used to parse RDF into a very concise data structure:

$ catmandu convert RDF --file rdfdata.ttl to YAML

Testing command line apps with App::Cmd

jakob — Fri, 01 Nov 2013 08:49:19 +0000

This posting has also been published at blogs.perl.org.

Ricardo Signes‘ App::Cmd has been praised a lot so I gave it a try for my recent command line app. In summary, the module is great although I missed some minor features and documentation (reminder to all: if you miss some feature in a CPAN module, don’t create yet another module but try to improve the existing one!). One feature I like a lot is how App::Cmd facilitates writing tests for command line apps. After having written a short wrapper around App::Cmd::Tester my formerly ugly unit tests look very simple and clean. Have a look at this example:

use Test::More;
use App::PAIA::Tester;

new_paia_test;

paia qw(config);
is stdout, "{}\n";
is error, undef;

paia qw(config -c x.json --verbose);
is error, "failed to open config file x.json\n";
ok exit_code; 

paia qw(config --config x.json --verbose foo bar);
is output, "# saved config file x.json\n";

paia qw(config foo bar);
paia qw(config base http://example.org/);
is exit_code, 0;
is output, '';

paia qw(config);
is_deeply stdout_json, { 
    base => 'http://example.org/',
    foo => 'bar',
}, "get full config"

done_paia_test;

The application is called paia – that’s how it called at command line and that’s how it is simply called as function in the tests. The wrapper class (here: App::PAIA::Tester) creates a singleton App::Cmd::Tester::Result object and exports its methods (stdout, stderr, exit_code…). This alone makes the test much more readable. The wrapper further exports two methods to set up a testing environment (new_paia_test) and to finish testing (done_paia_test). In my case the setup creates an empty temporary directory, other applications might clean up environment variables etc. Depending on your application you might also add some handy functions like stdout_json to parse the app’s output in a form that can better be tested.

My PhD thesis about data

jakob — Mon, 23 Sep 2013 07:03:55 +0000

I have finally received paper copies of my PhD thesis „Describing Data Patterns“, published and printed via CreateSpace. The full PDF has already been archived as CC-BY-SA, but a paper print may still be nice and more handy (it’s printed as small paperback instead of the large A4-PDF). You can get a copy for 12.80€ or 12.24€ via Amazon (ISBN 1-4909-3186-4).

I also set up a little website at aboutdata.org. The site contains an HTML view of the pattern language that I developed as one result of the thesis.

I am sorry for not having written the thesis in Pandoc Markdown but in LaTeX (source code available at GitHub), so there is no EPUB/HTML version.

On the way to a library ontology

jakob — Thu, 11 Apr 2013 13:02:50 +0000

I have been working for some years on specification and implementation of several APIs and exchange formats for data used in, and provided by libraries. Unfortunately most existing library standards are either fuzzy, complex, and misused (such as MARC21), or limited to bibliographic data or authority data, or both. Libraries, however, are much more than bibliographic data – they involve library patrons, library buildings, library services, library holdings, library databases etc.

During the work on formats and APIs for these parts of library world, Patrons Account Information API (PAIA) being the newest piece, I found myself more and more on the way to a whole library ontology. The idea of a library ontology started in 2009 (now moved to this location) but designing such a broad data model from bottom would surely have lead to yet another complex, impractical and unused library standard. Meanwhile there are several smaller ontologies for parts of the library world, to be combined and used as Linked Open Data.

In my opinion, ontologies, RDF, Semantic Web, Linked Data and all the buzz is is overrated, but it includes some opportunities for clean data modeling and data integration, which one rarely finds in library data. For this reason I try to design all APIs and formats at least compatible with RDF. For instance the Document Availability Information API (DAIA), created in 2008 (and now being slightly redesigned for version 1.0) can be accessed in XML and in JSON format, and both can fully be mapped to RDF. Other micro-ontologies include:

Document Service Ontology (DSO) defines typical document-related services such as loan, presentation, and digitization
Simple Service Status Ontology (SSSO) defines a service instance as kind of event that connects a service provider (e.g. a library) with a service consumer (e.g. a library patron). SSSO further defines typical service status (e.g. reserved, prepared, executed…) and limitations of a service (e.g. a waiting queue or a delay
Patrons Account Information API (PAIA) will include a mapping to RDF to express basic patron information, fees, and a list of current services in a patron account, based on SSSO and DSO.
Document Availability Information API (DAIA) includes a mapping to RDF to express the current availability of library holdings for selected services. See here for the current draft.
A holdings ontology should define properties to relate holdings (or parts of holdings) to abstract documents and editions and to holding institutions.
GBV Ontology contains several concepts and relations used in GBV library network that do not fit into other ontologies (yet).
One might further create a database ontology to describe library databases with their provider, extent APIs etc. – right now we use the GBV ontology for this purpose. Is there anything to reuse instead of creating just another ontology?!

The next step will probably creation of a small holdings ontology that nicely fits to the other micro-ontologies. This ontology should be aligned or compatible with the BIBFRAME initiative, other ontologies such as Schema.org, and existing holding formats, without becoming too complex. The German Initiative DINI-KIM has just launched a a working group to define such holding format or ontology.

Dead End Electronic Resource Citation (ERC)

jakob — Fri, 29 Mar 2013 09:51:26 +0000

Tidying up my PhD notes, I found this short rant about „Electronic Resource Citation“. I have not used it anywhere, so I publish it here, licensed under CC-BY-SA.

Electronic Resource Citation (ERC) was introduced by John Kunze with a presentation at the International Conference on Dublin Core and Metadata Applications 2001 and with a paper in the Journal of Digital Information, Vol. 2, No 2 (2002). Kunze cited his paper in a call for an ERC Interest Group within the Dublin Core Metadata Initiative (DCMI) at the PERL4LIB mailing list, giving the following example of an ERC:

erc:  Kunze, John A. | A Metadata Kernel for Electronic Permanence
      | 20011106 | http://jodi.ecs.soton.ac.uk/Articles/v02/i02/Kunze/

An ERC is a minimal „kernel“ metadata record that consist of four elements: who, what, when and where. In the given example they are:

who:   Kunze, John A.
what:  A Metadata Kernel for Electronic Permanence
when:  20011106
where: http://jodi.ecs.soton.ac.uk/Articles/v02/i02/Kunze/

Ironically the given URL is obsolete, the host ‚jodi.ecs.soton.ac.uk‘ does not even exist anymore. The ERC is pretty useless if it just uses a fragile URL to cite a resource. How about some value that does not change over time, e.g:

where: Journal of Digital Information, Volume 2 Issue 2

As ERC is defined as „a location or machine-oriented identifier“, one could also use stable identifiers:

where: ISSN 1368-7506, Article No. 81

Both ISSN and article numbers 81 are much more identifiers then URLs. Citing an URL is more like

where: at the desk in the little reading room of my library

By the way the current location is http://www.rice.edu/perl4lib/archives/2002-09/msg00017.html – but who knows whether Texas A&M University will still host the journal at this URL in 20 years?

There are some interesting ideas in the original ERC proposal (different kinds of missing values, TEMPER date values, the four questions etc.), but its specification and implementation are just ridiculous and missing references to current technology (you know that you are doing something wrong in specification if you start to define your own encodings for characters, dates etc. instead of concentrating to your core subject and refering to existing specifications for the rest). The current draft (2010) is a typical example of badly mixing modeling and encoding issues and of loosing touch with existing, established data standards.

In addition to problems at the „low level“ of encoding, the „high level“ of conceptual modeling lacks appropriate references. What about the relation of ERC concepts to models such as FRBR and CIDOC-CRM? Why are ‚who‘, ‚when‘, ‚where‘, ‚what‘ the important metadata fields (in many cases the most interesting question is ‚why‘)? How about Ranganathan’s colon classification with personality, matter, energy, space, and time?

In summary the motivation behind ERC contains some good ideas, but its form is misdirected.

Access to library accounts for better user experience

jakob — Fri, 08 Feb 2013 09:10:03 +0000

I just stumbled upon ReadersFirst, a coalition of (public) libraries that call for a better user experience for library patrons, especially to access e-books. The libraries regret that

the products currently offered by e-content distributors, the middlemen from whom libraries buy e-books, create a fragmented, disjointed and cumbersome user experience.

One of the explicit goals of ReadersFirst is to urge providers of e-content and integrated library systems for systems that allow users to

Place holds, check-out items, view availability, manage fines and receive communications within individual library catalogs or in the venue the library believes will serve them best, without having to visit separate websites.

In a summary of the first ReadersFirst meeting at January 28, the president of Queens Library (NY) is cited with the following request:

The reader should be able to look at their library account and see what they have borrowed regardless of the vendor that supplied the ebook.

This goal matches well with my activity at GBV: as part of a project to implement a mobile library app, I designed an API to access library accounts. The Patrons Account Information API (PAIA) is current being implemented and tested by two independent developers. It will also be used to provide a better user experience in VuFind discovery interfaces.

During the research for PAIA I was surprised by the lack of existing methods to access library patron accounts. Some library systems not even provide an internal API to connect to the loan system – not to speak of a public API that could directly be used by patrons and third parties. The only example I could find was York University Libraries with a simple, XML-based, read-only API. This lack of public APIs to library patron accounts is disappointing, given that its almost ten years after the buzz around Web 2.0, service oriented architecture, and mashups. All all major providers of web applications (Google, Twitter, Facebook, StackExchange, GitHub etc.) support access to user accounts via APIs.

The Patrons Account Information API will hopefully fill this gap with defined methods to place holds and to view checked out items and fines. PAPI is agnostic to specific library systems, aligned with similar APIs as listed above, and designed with RDF in mind (without any need to bother with RDF, apart from the requirement to use URIs as identifiers). Feedback and implementations are very welcome!