Embedded diagrams and pandoc

24. Januar 2012 um 13:02 Keine Kommentare

If you don’t know John MacFarlane’s Pandoc, the “Swiss army knive of document formats”, you should definitely give it a try! Pandoc’s abstract document model and its serialization in an extended variant of Markdown markup let you focus on the structure and content of a text instead of dealing with formats and user interfaces. In my opinion pandoc is the best tool for document creation invented since (La)TeX (moreover pandoc is a good argument to finally learn programming in Haskell) Images in pandoc markdown documents, however, are only referenced by their file. This requires some preprocessing if you want to create different files for different document formats, especially bitmap images and vector images. So I hacked a little preprocessing script that let’s you embed images in pandoc’s markup language. For instance you write

~~~~ {.dot .Grankdir:LR}
digraph {
A -> B -> C;
A -> C;
}
~~~~

and you get

or based on rdfdot you write

~~~~ {.rdfdot}
@prefix foaf: .
@base .
foaf:name “Alice” ;
foaf:knows [ foaf:name "Bob" ] .
~~~~

and you get

A detailed description is included in the manual which has been transformed automatically to HTML and to PDF. Compare both documents to see that HTML includes PNG images and PDF contains vector images!

Feel free to reuse and modify the script, for instance by adding more diagram types! For instance how about ASCII tabs and ABC notation if you write about music?

Linked local library data simplified

10. Januar 2012 um 14:53 1 Kommentar

A few days ago Lukas Koster wrote an article about local library linked data. He argues that bibliographic data from libraries data as linked data is not “the most interesting type of data that libraries can provide”. Instead “library data that is really unique and interesting is administrative information about holdings and circulation”. So libraries “should focus on holdings and circulation data, and for the rest link to available bibliographic metadata as much as possible.” I fully agree with this statements but not with the exact method how do accomplish the publication of local library data.

Among other project, Koster points to LibraryCloud to aggregate and deliver library metadata, but it looks like they reinvent yet more wheels in form of their own APIS and formats for search and for bibliographic description. Maybe I am wrong about this project, as they just started to collect holding and circulation data.

At the recent Semantic Web in Bibliotheken conference, Magnus Pfeffer gave a presentation about “Publishing and consuming library loan information as linked open data” (see slides) and I talked about a Simplified Ontology for Bibliographic Resources (SOBR) which is mainly based on the DAIA data model. We are going to align both data models and I hope that the next libraries will first look at these existing solutions instead of inventing yet another data format or ontology. Koster’s proposal, however, looks like such another solution: he argues that “we need an extra explicit level to link physical Items owned by the library or online subscriptions of the library to the appropriate shared network level” and suggests to introduce a “holding” level. So there would be five levels of description:

  • Work
  • Expression
  • Manifestation
  • Holding
  • Item

Apart from the fact that at least one of Work, Expression, Manifestation is dispensable, I disagree with a Holding level above the Item. My current model consists of at most three levels of documents:

  • document as abstract work (frbr:Work, schema:CreativeWork…)
  • bibliographic document (frbr:Manifestation, sobr:Edition…)
  • item as concrete single copy (frbr:Item…)

The term “level” is misleading because these classes are not disjoint. I depicted their relationship in a simple Venn diagram:

For local library data, we are interested in single items, which are copies of general documents or editions. Where do Koster’s “holding” entities fit into this model? He writes “a specific Holding in this way would indicate that a specific library has one or more copies (Items) of a specific edition of a work (Manifestation), or offers access to an online digital article by way of a subscription.” The core concepts as I read them are:

  • “one or more copies (Items)” = frbr:Item
  • “specific edition of a work (Manifestation)” = sobr:Edition or frbr:Manifestation
  • “has one [...] or offer access to” = ???

Instead of creating another entity for holdings, you can express the ability “to have one or offer access to” by DAIA Services. The class daia:Service can be used for an unspecified service and more specific subclasses such as loan, presentation, and openaccess can be used if more is known. Here is a real example with all “levels”:


<http://dbpedia.org/resource/Tractatus_Logico-Philosophicus>
    a bibo:Book ;
    daia:edition <urn:nbn:de:eki:GBV128382600> ;
    daia:exemplar
        <http://uri.gbv.de/document/opac-de-23:epn:266449999> .

<urn:nbn:de:eki:GBV128382600> a bibo:Book ;
    daia:exemplar
        <http://uri.gbv.de/document/opac-de-23:epn:266449999> .

<http://uri.gbv.de/document/opac-de-23:epn:266449999>
    a bibo:Book, daia:Item ;
    daia:heldBy <http://uri.gbv.de/organization/isil/DE-23> ;
        daia:availableFor [
            a daia:Service ;
            daia:providedBy <http://uri.gbv.de/organization/isil/DE-23>
        ] .

I have only made up the RDF property daia:edition from the SOBR proposal because FRBR relations are too strict. If you know a better relation to directly relate an abstract work to a concrete edition, please let me know.



image created with rdfdot

Request for comments: final specification of DAIA

6. Januar 2012 um 12:13 2 Kommentare

When I started to create an API for availability lookup of document in libraries in 2008, I was suprised that such a basic service was so poorly defined. The best I could find was the just-published recommendation of the Digital Library Federation (DLF-ILS). Even there availability status was basically a plain text message (section 6.3.1 and appendix 4 and 5). Other parts of the DLF-ILS GetAvailability response were more helpful, so they are all part of the Document Availability Information API (DAIA). Here is a simple mapping from DLF-ILS to DAIA:

  • bibliographicIdentifer (string) → document (URI)
  • itemIdentifier (string) → item (URI)
  • dateAvailable (dateTime) → expected (xs:dateTime or xs:date or “unknown”) or delay (xs:duration or “unknown”)
  • location (string) → storage (URI and/or string, plus optional URL)
  • call number (string) → label (string)
  • holdQueueLength (int) → queue (xs:nonNegativeInteger)
  • status (string) and circulating (boolean) → available/unavailable (with service type and additional information)

So you could say that DAIA implements the abstract GetAvailability function from DLF-ILS. I like abstract, language independent specifications, but they must be precise and testable (see Meek’s forgotten paper The seven golden rules for producing language-independent standards). DAIA is more than an implementation: it provides both, an abstract standard and bindings to several data languages (XML, JSON, and RDF). The conceptual DAIA data model defines some basic concepts and relationships (document, items, organisations, locations, services, availabilities, limitations…) independent from whether they are expressed in XML elements, attributes, RDF properties, classes, or any other data structuring method. The only reference to specific formats is the requirement that all unique identifiers must be URIs. Right now there is an XML Schema if you want to express DAIA in XML and an OWL ontology for RDF.

In its fourth year of development (see my previous posts from 2009) DAIA seems to have enough momentum to finally get accepted in practice. We use it in GBV library union (public server at http://daia.gbv.de/), there are independent implementations such as in Doctor-Doc, there is client-support in VuFind and I heard rumors that DAIA capabilities will be build into EBSCO and Summon Discovery Services. Native support in Integrated Library Systems, however, is still lacking – I already have given up hope and prefer a clean DAIA wrapper over a broken DAIA-implementation anyway. If you are interested in creating your own DAIA server/wrapper or client, have a look at my reference implementation DAIA and Plack::App::DAIA at CPAN and Oliver Goldschmidt’s PHP implementation in our common github repository. A conceptual overview as tree (DAIA/JSON, DAIA/XML) and as graph (DAIA/RDF) can be found here.

Still there are some details to be defined and I’d like to solve these issues to come to a version DAIA 1.0. These are

  • How to deal with partial publications (you requested an article but only get the full book or you requested a series but only get a single volume).
  • How to deal with digital publications (especially its possible service types: is “download” a service distinct to “loan” or is “presentation” similar to online access restricted to the library’s intranet?).
  • Final agreement on service types (now there are presentation: item can be used in the institution, loan: item can be used outside of the institution for a limited time, interloan: item can be send to another institution, openaccess: item can be access unrestricted, just get a free copy). Some extensions have been proposed.
  • A set of common limitation types (for instance IP-based access restriction, permission-based access etc.).

I’d be happy to get some more feedback on these issues, especially concrete use cases. We are already discussing on the daia-devel mailing list but you can also comment in your own blog, at public-lld, code4lib, ils-di etc.).

Can SOBR help publishing library holdings?

2. Dezember 2011 um 01:08 5 Kommentare

I just participated in the German conference Semantic Web in Bibliotheken which took place in Hamburg this week. This year there were two slots for lightning talks, but unfortunately participants did not catch on, so we only had four of them. Lightning talks are a good chance to present something unfinished that you need input for, so I presented the Simplified Ontology for Bibliographic Resources (SOBR) as “FRBR light”. You can find the current specification of SOBR at github, which means the specification is still evolving and I’d like to get more feedback.

SOBR was caused by a discussion on the Library Linked Data mailing list about the (disputed) disjointedness of FRBR classes. SOBR has a history in the Document Availability Information API (DAIA), which SOBR might be merged into. The use case of both is publishing information about holdings from library catalogs as Linked Open Data. The information most requested is probably connected to holdings: library users only ask “where is it?” and “how can I get it?”. In this questions, the little word “it” refers to a specific publication. In the answers, however, “it” usually refers to some holding or copy of this publication. Sometimes the holding contains more than the publication (for instance if you ask for an article in a book) and sometimes you get multiple holdings (for instance if you ask for a a large work that is split in multiple volumes). Sometimes there are multiple holdings with different content to choose from, because there are different editions, forms, translations etc. of the requested publication.

A long time ago, some librarians thought about similar questions and answers and came up with the Functional Requirements for Bibliographic Records (FRBR). I tried hard to accept FRBR (I even draw this ugly diagrams that people find when they look up FRBR in Wikipedia). But FRBR does not help me to publish existing library catalogs as Linked Open Data. In our catalog databases we have records that refer to editions, connected with records that refer to holdings (I’ll ignore the little exceptions and nasty special cases such as multiple holdings described by one look-like-a-holding-record). In addition there are some records that refer to series, works, and other types of abstract documents without direct holdings, which are connected to records that refer to editions.

Maybe we can simplify this to two entities: general documents (bibo:Document) and items (with frbr:Item) as special kind of documents. The current design of SOBR also contains a class for editions, but I am not sure whether this class is also needed. At least we need three properties to relate documents to items (daia:exemplar), to relate documents to editions (daia:edition?) and to relate documents to its parts (dcterms:hasPart). To avoid the need of blank nodes, I’d also define properties that relate documents to partial items (daia:extract = dcterms:hasPart + daia:exemplar) and to relate documents to partial editions (daia:editionPart?)

Feedback on SOBR is welcome, especially if you provide examples with existing URIs (or at least local identifiers to already existing data) instead of theoretical FRBR-like-made-up examples. The best way to find a good ontology for publishing library holdings is to actually publish data that describes library holdings! The following image is based on an example that connects a work from LibraryThing and from DBPedia with a partial edition from Worldcat, a full edition from German National Library, and a holding from Hamburg University:

@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix daia: <http://purl.org/ontology/daia/> .
@prefix frbr: <http://purl.org/vocab/frbr/core#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<http://www.librarything.com/work/70394> a bibo:Document ;
  owl:sameAs <http://dbpedia.org/resource/Living_My_Life> ;
  daia:edition <http://d-nb.info/1001703464> , [
    a bibo:Collection # , daia:Document
    ; dct:hasPart <http://www.worldcat.org/oclc/656754414>
  ]
    ;
  # daia:exemplar <http://d-nb.info/1001703464> ; ?
  daia:editionPart <http://www.worldcat.org/oclc/656754414> .

<http://d-nb.info/1001703464> a frbr:Item , bibo:Document ;
   daia:exemplar <http://uri.gbv.de/document/opac-de-18:epn:1220640794> .



image created with rdfdot

URI namespace lookup with prefix.cc and RDF::NS

3. November 2011 um 17:13 Keine Kommentare

Probably the best feature of RDF is that it forces you to use Uniform Resource Identifiers (URI) instead of private, local identifiers which only make sense in a some context. URIs are long and cumbersome to type, so popular URIs are abbreviated with namespaces prefixes. For instance foaf:Person is expanded to http://xmlns.com/foaf/0.1/Person, once you have defined prefix foaf for namespace http://xmlns.com/foaf/0.1/. In theory URI prefixes in RDF are arbitrary (in contrast to XML where prefixes can matter, in contrast to popular belief). In practice people prefer to agree to one or two known prefixes for common URI namespaces (unless they want to obfuscate human readers of RDF data).

So URI prefixes and namespaces and are vital for handling RDF. However, you still have to define them in almost every file and application. In the end people have copy & paste the same prefix definitions again and again. Fortunately Richard Cyganiak created a registry of popular URI namespaces, called prefix.cc (it’s open source), so people at least know where to copy & paste from. I had enough of copying the same URI prefixes from prefix.cc over and over again, so I created a Perl module that includes snapshots of the prefix.cc database. It includes a simple command line client, that is installed automatically:

  $ sudo cpanm RDF::NS
  $ rdfns rdf,foaf.ttl
  @prefix foaf:  .
  @prefix rdf:  .

In your Perl code, you can use it like this:

  use RDF::NS
  my $NS = RDF::NS->new('20111102');
  $NS->foaf_Person; # returns "http://xmlns.com/foaf/0.1/Person"

If you miss an URI prefix, just add it at http://prefix.cc, and will be included in the next release.

TPDL 2011 Doctoral Consortium – part 3

25. September 2011 um 17:36 Keine Kommentare

See also part 1 and part 2 of conference-blogging and #TPDL2011 on twitter.

My talk about general patterns in data was recieved well and I got some helpful input. I will write about it later. Steffen Hennicke, another PhD student of my supervisor Stefan Gradman, then talked about his work on modeling Archival Finding Aids, which are possibly expressed in EAD. The structure of EAD is often not suitable to answer user needs. For this reason Hennicke analyses EAD data and reference questions, to develope better structures that users can follow to find what they look for in archives. This is done in CIDOC-CRM as a high-level ontology and the main result will be an expanded EAD model in RDF. To me the problem of “semantic gaps” is interesting, and I think about using some of Hennicke data as example to explain data patterns in my work.

The last talk by Rita Strebe was about aesthetical user experience of websites. One aim of her work is to measure the significance of aesthetical perception. In particular her hypothesis to be evaluated by experiments are:

H1: On a high level, the viscerally perceived visual aesthetics of websites effects
approach behaviour.
H2: On a low level, the viscerally perceived visual aesthetics of websites effects
avoidance behaviour.

Methods and preliminary results look valid, but the relation to digital libraries seems low and so was the expertise of Strebe’s motivation and methods among the participants. I suppose her work better fits to Human-Computer Interaction.

After the official part of the program Vladimir Viro briefly presented his music search engine peachnote.com, that is based on scanned muscial scores. If I was working in or with musical libraries, I would not hesitate to contact Viro! I also though about a search for free musical scores in Wikimedia framework. The Doctoral Consortium ended with a general discussion about dissertation, science, libraries, users, and everything, as it should be :-)

TPDL 2011 Doctoral Consortium – part 2

25. September 2011 um 12:42 Keine Kommentare

The TPDL 2011 Doctoral Consortium, which I already blogged about in part 1, continued with 15 minutes of delay: Christopher Gibson also talked about eBooks – I wonder why his talk was not combined with Luca Colombo’s work in eBook reading experiences. Gibson’s specific topic is eBook lending services in UK public libraries. To quote the research questions from his paper:

Q1. How have public libraries addressed ebook service provision in the UK?
Q2. What challenges and opportunities exist in incorporating ebook lending into other reader services?
Q3. Is it feasible to lend ebook reading devices from public libraries?
Q4. How can the effectiveness of ebook lending services be measured?
Q5. How do library users view the provision of ebook lending services?
Q6. How can effective ebook lending services be developed?

To me an interesting aspect of his methodology was the use of targeted FOI (freedom of information) requests to gather data about eBook lending services. I cannot image this in this Germany where “Informationsfreiheit” is still in its infancy. One result from another survery done by Gibson: most eBooks are not included in library catalogs. I think this failure is found in German libraries too. In summary the PhD project looked very profound with some real practical values for libraries. On the other hand, the theoretical contribution, for instance the question what “lending” can mean in a digital library work, was only added in the discussion afterwards.

The next presenting PhD student was Adam Sofronjievic. I am sorry that I could not fully concentrate on his talk about a New Paradigm of Library Collaboration although it seemed very interesting. My talk is next :-)

TPDL 2011 Doctoral Consortium

25. September 2011 um 11:21 2 Kommentare

Today the International Conference on Theory and Practice of Digital Libraries 2011 started with tutorials and a Doctoral Consortium that I participate with a talk. The seven talks and discussions on ongoing PhD topics were rather diverse and interesting. I tried to briefly summarize at least some of them.

Luco Colombo started with his work on developing and evaluating eBook reading experience for children. Reading “traditional” books has been extensively investigated – this is not true for eBooks. Especially children are little involved in eBook studies. Colombo explained how the eBook reading experience is different because it directly involves searching, browsing, sharing, and recommending, among other arguments. A good reading experience results in a “flow state” where the reading gets positively lost in a book. Colombo’s method is a cooperative inquiry. It is not clear whether and by what eBooks are more engaging to children (age 9-11 in this study) than traditional books – maybe this PhD will show. The following discussion was dominated by the participating mentors Jose Borbinha, Milena Dobreva, Stefan Gradmann and Giuseppina Vullo.

In the second talk Krassimira Ivanova presented her dissertation on (content-based) image retrieval utilizing color models. Image retrieval on art images is difficult because it includes very different aspects (artistic styles, depicted objects etc.). Even aspects of color (contrasts, intensity, diversity, harmony etc.) are manifold – maybe this is why philosophy of color has a long history. Nevertheless Ivanova developed several machine learning methods for this color aspects that can be used for image retrieval. I am not sure whether the resulting APICAS system (“Art Painting Image Colour Aesthetics and Semantics”) has been evaluated with a user study. Similar to the first talk, the focus could be improved by more narrowing down and making clear the specific contribution. Finally we had some real discussion, but little time.

Modeling is difficult

21. September 2011 um 00:33 2 Kommentare

Yesterday Pete Johnston wrote a detailed blog article about difficulties of “the right” modeling with SKOS, and FOAF in general, and about the proposed RDF property foaf:focus in particular. As Dan Brickley wrote in a recent mail “foaf:focus describes a link from a skos:Concept to ‘the thing itself’. Not every SKOS concept (in a thesauri of classification scheme) will have such a direct “thing”, but many do, especially concepts for people and places.”

Several statements in this discussion made me laugh and smile. Don’t get me wrong – I honor Pete, Dan, and the whole Semantic Web community, but there is a regular lack of philosophy and information science. There is no such thing as ‘the thing itself’ and all SKOS concepts are equal. Even the distinction between an RDF ‘resource’ and an SKOS ‘concept’ is artificial. The problem origins not from wrong modeling, which could be solved by the right RDF properties, but from different paradigms and cultures. There will always be different ways to describe the same ideas with RDF, because neither RDF nor any other technology will ever fully catch our ideas. These technologies are not about things but only about data. As William Kent wrote in Data Reality (1978): “The map is not the territory” (by the way, last year Chris Rusbridge has quoted Kent in the context of linked data). As Erik Wilde and Robert J. Glushko wrote in a great article (2008):

RDF has succeeded beyond the wildest expectations as a convenient format for encoding information in an open and easily computable fashion. But it is just a format, and the difficult work of analysis and modeling information has not and will never go away.

Ok, they referred not to “RDF” but to “XML”, so the quotation is wrong. But the statement is right for both data structuring methods. No matter if you put your data in XML, in RDF, or carve it in stone – there will never be a final model, because there’s more than one way to describe something.

Query parts in Linked Data URIs considered harmful

29. Mai 2011 um 01:40 Keine Kommentare

The generic syntax of URI (and of IRI with slightly different definition of the elements) is:

<scheme name> : <hierarchical part> [ ? <query> ] [ # <fragment> ]

RDF is based on “URI references” which have a different syntax, but all practical RDF data should follow the generic URI syntax. This is also implied by the Linked Data Principles which encourage you to use HTTP URIs. Furthermore it is a good advice not to include fragments in your URIs if you care about coexistence of the Web and the Semantic Web. Yes, there is RDF data with fragment parts and the so called Hash URIs are even given as one design pattern, but you can avoid a lot of trouble if you use URIs without them. By the way, fragment identifiers are also used a lot in JavaScript-based web applications that spoil the whole concept of REST as described in this recent in-depth article on Hash URIs.

I would even go further and say that well-designed URIs for Linked Data should also forgo the query part. A clean URI to be used as identifier in Linked Data should stick to this limited generic syntax:

http: <hierarchical part>

I do not argue to disallow URIs with query part, but I wonder how much they are really used and needed for resources other than Web resources. Especially URIs for non-information resources, should not have a query part. Most web applications will not distinguish between these two:

http://example.org?foo=1&bar=2

http://example.org?bar=2&foo=1

These are different URIs but equivalent URLs. Choosing URIs that are not equivalent to other URIs in common non-RDF applications is a good advice, isn’t it? If you choose a cleaner URI like http://example.org/foobar you could use additional query parts to refer to different representations (information resources) and services around the resource that is referenced by the base URI.

Powered by WordPress with Theme based on Pool theme and Silk Icons.
Entries and comments feeds. Valid XHTML and CSS. ^Top^