O’Reilly steigt mit HTMLBook auf HTML5 um

22. September 2013 um 13:22 Keine Kommentare

O’Reilly Media, führender und einflussreicher Verlag im Bereich elektronisches Publizieren und technischer Handbücher, hat angekündigt von DocBook XML auf eine Teilmenge von HTML5 umzusteigen. Neben dem Blogartikel wird der Schritt in einem Fachartikel auf der Balisage Konferenz erläutert. HTML5 soll sowohl erweiterte Dokumentformen (aka „Beyond the PDF“) ermöglichen, als auch gleichzeitig als Autorenwerkzeug dienen. Die von O’Reilly entwickelte Teilmenge von HTML5 heisst „HTMLBook“ und ist ein einem öffentlichen Git-Repository einsehbar. Ich persönlich bin ja etwas skeptisch, da WYSIWYG dem Ziel widerspricht, Inhalte für verschiedene Ausgabeformate zu entwerfen und ziehe deshalb Pandoc vor.

Mapping bibliographic record subfields to JSON

13. April 2011 um 16:26 4 Kommentare

The current issue of Code4Lib journal contains an article about mapping a bibliographic record format to JSON by Luciano Ramalho. Luciano describes two approaches to express the CDS/ISIS format in a JSON structure to be used in CoudDB. The article already provoked some comments – that’s how an online journal should work!

The commentators mentioned Ross Singer’s proposal to serialize MARC in JSON and Bill Dueber’s MARC-HASH. There is also a MARC-JSON draft from Andrew Houghton, OCLC. The ISIS format reminded me at PICA format which is also based on fields and subfields. As noted by Luciano, you must preserves subfield ordering and allow for repeated subfields. The existing proposals use the following methods for subfields:

Luciano’s ISIS/JSON:

[ ["x","foo"],["a","bar"],["x","doz"] ]

Ross’s MARC/JSON:

"subfields": [ {"x":"foo"},{"a":"bar"},{"x":"doz"} ]

Bill’s MARC-HASH:

[ ["x","foo"],["a","bar"],["x","doz"] ]

Andrew’s MARC/JSON:

"subfield": [
  {"code":"x","data":"foo"},{"code":"a","data":"bar"},
  {"code":"x","data":"doz"} ]

In the end the specific encoding does not matter that much. Selecting the best form depends on what kind of actions and access are typical for your use case. However, I could not hesitate to throw my encoding used in luapica into the ring:

{ "foo", "bar", "doz", 
  ["codes"] = { 
    ["x"] = {1,3}
    ["a"] = {2}
}}

I think about further simplifying this to:

{ "foo", "bar", "doz", ["x"] = {1,3}, ["a"] = {2} }

If f is a field than you can access subfield values by position (f[1], f[2], f[3]) or by subfield code f[f.x[1]],f[f.a[1]],f[f.x[2]]. By overloading the table access method, and with additional functions, you can directly write f.x for f[f.x[1]] to get the first subfield value with code x and f:all("x") to get a list of all subfield values with that code. The same structure in JSON would be one of:

{ "values":["foo", "bar", "doz"], "x":[1,3], "a":[2] }
{ "values":["foo", "bar", "doz"], "codes":{"x":[1,3], "a":[2]} }

I think a good, compact mapping to JSON that includes an index could be:

[ ["x", "a", "x"], {"x":[1,3], "a":[2] },
  ["foo", "bar", "doz"], {"foo":[1], "bar":[2], "doz":[3] } ]

And, of course, the most compact form is:

["x","foo","a","bar","x","doz"]

Data is not meaning – but a web badge

6. Januar 2011 um 00:57 3 Kommentare

I’m am sure that Douglas Adams and John Lloyd had a word for it: you know exactly what you mean, but not how to call it. Recently I tried to find information about : A particular kind of „web banner“, „button“, or „badge“ with specific size, border, and two parts. I finally found out, that it is a 80×15 web badge as introduced by Antipixel in 2002. A helpful description of the format is given by ZwahlenDesign, who also points to two online badge creation tools: Brilliant Button Maker and Button Maker. Note that the tools use „Button“ instead of „Badge“ to refer to the same thing.

I created a web badge to promote a simple philosophical web standard: data is not meaning* Here is the data as 177 bytes hexdump:

89 50 4E 47 0D 0A 1A 0A 00 00 00 0D 49 48 44 52 00 00 00 50 00 00 00 0F 01 03 00 00 00 49 07 DA CC 00 00 00 01 73 52 47 42 00 AE CE 1C E9 00 00 00 06 50 4C 54 45 FF FF FF 00 00 00 55 C2 D3 7E 00 00 00 59 49 44 41 54 08 D7 63 F8 0F 07 0C 0D 0C 50 C0 C8 B0 FF FF FF 0F D8 99 0D 10 E6 8E CF 7D 05 2D 7E 86 42 2E 85 0C BB 73 EF 6E 7E 76 C2 73 52 4A 23 C3 EE C4 3B 06 AD 7E 95 41 21 1B C1 A2 4F 3C 3C 8D 7C 26 32 EC 78 7B 77 43 8B 9F A7 90 4B 22 B2 09 D8 AD 40 72 03 C2 65 00 CA 67 45 A7 86 69 B7 81 00 00 00 00 49 45 4E 44 AE 42 60 82

If data was meaning, that should be all to say. But data is just a stream of bits, bytes, numbers, characters, strings, nodes, triples, or sometimes even words. You have to make use of it in a meaningful way. For instance you could give the data above to a specific piece of software like your web browser. Here comes the data again:

This PNG image was the smallest I was able to create with optimized colors, LZW compression etc. There is another piece of data, that is only eight bits more (178 bytes) and looks the same as GIF image:

I could also express the monochrome badge with one bit per pixel. That makes 80×15 = 1200 bits = 150 bytes uncompressed. The meaning could be the same, but not when only looking at your browser (because this piece of software cannot handle my „ad-hoc monochrome 80×15 web badge format“).

I also created a version with color. Feel free to use and modify as you like. In this case the PNG with 198 bytes is slightly larger than the GIF with 196 bytes.

PNG:

GIF:

* I was suprised that there were only seven Google hits for this phrase, none of them with the same meaning (sic!) that I try to express by this article. The badge was inspired by this important warning sign.

P.S: Enough data philosophy, time for music. There is so much more than one and zero and one and zero and one!

Who identifies the identifiers?

10. Mai 2009 um 16:39 7 Kommentare

A few days ago, after a short discussion on Twitter, Ross Singer posted a couple of open questions about identifiers for data formats on code4lib and other related mailing lists. He outlined the problem that several APIs like Jangle, unAPI, SRU, OpenURL, and OAI-PMH use different identifiers to specify the format of data that is transported (MARC-XML, Dublin Core, MODS, BibTeX etc.). It is remarable that all these APIs are more or less relevant only in the libraries sector while the issue of data formats and its identifiers is also relevant in other areas – looks like the ivory tower of library standards is still beeing build on.

The problem Ross issued is that there is little coordination and each standard governs its own registry of data format identifiers. An inofficial registry for unAPI [archive] disappeared (that’s why I started the discussion), there is a registry for SRU, a registry for OpenURL, and a list for Jangle. In OAI-PMH and unAPI each service hosts its own list of formats, OAI-PMH includes a method to map local identifier to global identifiers.

On code4lib several arguments and suggestions where raised which almost provoced me to a rant on library standards in general (everyone want’s to define but noone likes to implement and reuse. Why do librarians ignore W3C and IETF?). Identifiers for data formats should neither be defined by creators of transport protocols nor do we need yet another über-registry. In my point of view the problem is less technical but more social. Like Douglas Campbell writes in Identifying the identifiers, one of the rare papers on identifier theory: it’s not a technology issue but a commitment issue.

First there is a misconception about registries of data format identifiers. You should distinguish descriptive registries that only list identifiers and formats that are defined elsewhere and authoritative registries that define identifiers and formats. Yes: and formats. It makes no sense to define an identifier and say that is stands for data format X if you don’t provide a specification of format X (either via a schema or via a pointer to a schema). This already implies that the best actor to define a format identifier is the creator of the format itself.

Second local identifiers that depend on context are always problematic. There is a well-established global identifier system called Uniform Resource Identifier (URI) and there is no excuse not to use URIs as identifiers but incapability, dullness, laziness, or ignorance. The same reasons apply if you create a new identifier for a data format that already has one. One good thing about URI is that you can always find out who was responsible for creating a given identifier: You start with the URI Scheme and drill down the namespaces and standards. I must admin that this process can be laborious but at least it makes registries of identifiers descriptive for all identifiers but the ones in their own namespace.

Third you must be clear on the definition of a format. For instance the local identifier „MARC“ does not refer to a format but to many variants (USMARC, UNIMARC, MARC21…) and encodings (MARCXML/MARC21). This is not unusual if you consider that many formats are specializations of other formats. For instance ATOM (defined by RFC4287 and RFC5023, identified either its Mime Type „application/atom+xml“ which can could expressed as URI http://www.iana.org/assignments/media-types/application/atom%2Bxml or by its XML Namespace „http://www.w3.org/2005/Atom“)* is extended from XML (specified in http://www.w3.org/TR/xml [XML 1.0] and http://www.w3.org/TR/xml11 [XML 1.1], identified by this URLs or by the Mime Type „application/xml“ which is URI http://www.iana.org/assignments/media-types/application/xml)*.

The problem of identifying the right identifiers for data formats can be reduced to two fundamental rules of thumb:

1. reuse: don’t create new identifiers for things that already have one.

2. document: if you have to create an identifier describe its referent as open, clear, and detailled as possible to make it reusable.

If there happen to exist multiple identifiers for one thing, choose the one that is documented and adopted best. There will always be multiple identifiers for the same thing – don’t make it worse.

*Footnote: The identification of Internet Media Types with URIs that start with http://www.iana.org/assignments/media-types/ is neither widely used nor documented well but it’s the most official URI form that I could find. If for a particular format there is a better identifier – like an XML or RDF namespace – then you should use that, but if there is nothing but a Mime Type then there is no reason to create a new URI on your own.