Query parts in Linked Data URIs considered harmful

29. Mai 2011 um 01:40 Keine Kommentare

The generic syntax of URI (and of IRI with slightly different definition of the elements) is:

<scheme name> : <hierarchical part> [ ? <query> ] [ # <fragment> ]

RDF is based on „URI references“ which have a different syntax, but all practical RDF data should follow the generic URI syntax. This is also implied by the Linked Data Principles which encourage you to use HTTP URIs. Furthermore it is a good advice not to include fragments in your URIs if you care about coexistence of the Web and the Semantic Web. Yes, there is RDF data with fragment parts and the so called Hash URIs are even given as one design pattern, but you can avoid a lot of trouble if you use URIs without them. By the way, fragment identifiers are also used a lot in JavaScript-based web applications that spoil the whole concept of REST as described in this recent in-depth article on Hash URIs.

I would even go further and say that well-designed URIs for Linked Data should also forgo the query part. A clean URI to be used as identifier in Linked Data should stick to this limited generic syntax:

http: <hierarchical part>

I do not argue to disallow URIs with query part, but I wonder how much they are really used and needed for resources other than Web resources. Especially URIs for non-information resources, should not have a query part. Most web applications will not distinguish between these two:

http://example.org?foo=1&bar=2
http://example.org?bar=2&foo=1

These are different URIs but equivalent URLs. Choosing URIs that are not equivalent to other URIs in common non-RDF applications is a good advice, isn’t it? If you choose a cleaner URI like http://example.org/foobar you could use additional query parts to refer to different representations (information resources) and services around the resource that is referenced by the base URI.

Hardware-Emulatoren im Browser

18. Mai 2011 um 00:48 1 Kommentar

Für die meisten Nerds war die Nachricht der letzten Tage wahrscheinlich, das Meister Fabrice Bellard einen x86-Emulator in Javascript geschrieben hat. Seine Demo bootet einen Linux-Kernel, inklusive C-Compiler, so dass er sich prinzipiell zum Bootstrapping von anderen Anwendungen verwendet werden kann. Vielleicht lässt sich auch mit MS-DOS und Turbo-C++ die Umgebung zum Laufen bekommen, mit ich mir Mitte der 1990er Programmieren beigebracht habe.

Der x86-Emulator ist jedoch nicht der einzige Emulator mit dem nicht einmal 20 Jahre alte Hardware in einem Browser-Tab simuliert werden kann. Hier einige Fundstücke:

Weitere verwandte Projekte sind unter Anderem JS/UIX (Unix in JavaScript), Emscripten (Crosscompiler zu JavaScript) und Emulatoren die als Java-Plugin im Browser laufen wie z.B. Virtual Apple ][. JavaScript hat jedoch den Vorteil, das nichts zusätzlich installiert werden muss und dass der Emulator in einer Sandbox läuft (wobei es sicher für jeden Browser JavaScript-Exploits gibt). Wirklich cool wäre ein Emulator für den NeXT auf dem Tim Berners-Lee den ersten Webbrowser entwickelte. Ich warte jetzt auf den ersten Emulator der in JavaScript eine virtuelle Maschine erzeugt, in der ein JavaScript-fähiger Browser läuft *g*

Research data, git hashes, and unbreakable links

5. Mai 2011 um 23:57 1 Kommentar

Yesterday I had a short conversation about libraries and research data. This topic seems to be trendy, for instance D-Lib magazine just had a special issue about it. I am not quite sure about the role of libraries for the management of research data. It looks like many existing projects at least aim at analyzing research data – this can get very complex because any data can be research data. Maybe librarians should better limit to what they can do best and stick to metadata. It is not the job of librarians to analyze traditional publications (there are scientists to do so, for instance in philology), so why should they start analyzing research data? It would be easier if we just treat research data as „blobs“ (plain sequences of bytes) to not get lost in the details of data formats. There will be still enough metadata to deal with (although this metadata might better be managed by the users).

One argument in our discussion was that libraries might just use the distributed revision control system git. Git is also trendy, but among software developers that must track many files of source code with revisions and dependencies. Although git is great for source code and lousy for raw binary data, we could learn something from its architecture (actually there is an extension to git to better handle large binary files). I already knew that git uses hash sums and hash trees and was curious how it actually stores data and metadata.

Management of data in git is basically based on the SHA1 hashing algorithm, but you could also use another hashing method. This answer told me how git calculates the SHA1 for a chunk of data. Note that the name of a file is not part of the calculation, as the filename is no data but metadata. You can move around and rename a file; its hash remains the same. More details of how git stores data and metadata about collections of data chunks (filenames and commits) can be found in the git community book, in the git book and in this blog article how git stores your data.

Hashes are also used in peer-to-peer networks to reference files of unknown location. With distributed hash tables you can even decentralize the lookup mechanism. Of course someone still needs to archive the data, but if the data is stored at least somewhere in the system, it cannot get lost by wrong names or broken links. Instead of pointing to locations of files, metadata about research data should contain an unbreakable link to the data in form of its hash. Libraries that want to deal with research data can then focus on metadata. Access to data could be provided via BitTorrent or any other method. The problem of archiving is another issue that should better be solved independently from description and access.

A third trendy topic is linked data and RDF. You can use clean URIs like this to refer to any chunk of data: urn:sha1:cd50d19784897085a8d0e3e413f8612b097c03f1

To make it even more trendy (you are welcome to reuse my idea in your next library research project proposal 😉 put the data objects into the cloud. No more file names, no more storage media – data is just a link in form of a hash value and a big cloud that you can look up data chunks by their hash.

P.S: A short explanation why you can really replace any piece of (research) data by its hash: There are 2160 different SHA1 hash values. According to rules of probability the expected number of hashes that can be generated before an accidental collision („birthday paradox'“) is 280. The sun will expand in around 5 billion years (less than 258 seconds from now, making life on earth impossible. That means until then we can still generate 2^44 (4 million) hashes per second and collisions are still unlikely. With cryptographic attacks the number can be smaller but it is still much larger than other sources of error.