Research data, git hashes, and unbreakable links

5. Mai 2011 um 23:57 1 Kommentar

Yesterday I had a short conversation about libraries and research data. This topic seems to be trendy, for instance D-Lib magazine just had a special issue about it. I am not quite sure about the role of libraries for the management of research data. It looks like many existing projects at least aim at analyzing research data – this can get very complex because any data can be research data. Maybe librarians should better limit to what they can do best and stick to metadata. It is not the job of librarians to analyze traditional publications (there are scientists to do so, for instance in philology), so why should they start analyzing research data? It would be easier if we just treat research data as „blobs“ (plain sequences of bytes) to not get lost in the details of data formats. There will be still enough metadata to deal with (although this metadata might better be managed by the users).

One argument in our discussion was that libraries might just use the distributed revision control system git. Git is also trendy, but among software developers that must track many files of source code with revisions and dependencies. Although git is great for source code and lousy for raw binary data, we could learn something from its architecture (actually there is an extension to git to better handle large binary files). I already knew that git uses hash sums and hash trees and was curious how it actually stores data and metadata.

Management of data in git is basically based on the SHA1 hashing algorithm, but you could also use another hashing method. This answer told me how git calculates the SHA1 for a chunk of data. Note that the name of a file is not part of the calculation, as the filename is no data but metadata. You can move around and rename a file; its hash remains the same. More details of how git stores data and metadata about collections of data chunks (filenames and commits) can be found in the git community book, in the git book and in this blog article how git stores your data.

Hashes are also used in peer-to-peer networks to reference files of unknown location. With distributed hash tables you can even decentralize the lookup mechanism. Of course someone still needs to archive the data, but if the data is stored at least somewhere in the system, it cannot get lost by wrong names or broken links. Instead of pointing to locations of files, metadata about research data should contain an unbreakable link to the data in form of its hash. Libraries that want to deal with research data can then focus on metadata. Access to data could be provided via BitTorrent or any other method. The problem of archiving is another issue that should better be solved independently from description and access.

A third trendy topic is linked data and RDF. You can use clean URIs like this to refer to any chunk of data: urn:sha1:cd50d19784897085a8d0e3e413f8612b097c03f1

To make it even more trendy (you are welcome to reuse my idea in your next library research project proposal 😉 put the data objects into the cloud. No more file names, no more storage media – data is just a link in form of a hash value and a big cloud that you can look up data chunks by their hash.

P.S: A short explanation why you can really replace any piece of (research) data by its hash: There are 2160 different SHA1 hash values. According to rules of probability the expected number of hashes that can be generated before an accidental collision („birthday paradox'“) is 280. The sun will expand in around 5 billion years (less than 258 seconds from now, making life on earth impossible. That means until then we can still generate 2^44 (4 million) hashes per second and collisions are still unlikely. With cryptographic attacks the number can be smaller but it is still much larger than other sources of error.

Ein Kommentar »

RSS feed for comments on this post. TrackBack URI

  1. Librarians and their role with research data. That’s quite an interesting topic. Check out the recent statement of John Wood, Chair of the EC High-Level Group on Scientific Data:

    „We need not librarians, but we need data scientists, who know how to operate in the space. And that’s a new type of discipline. […] This person is based in the laboratory or in a library, but is on the side in the team as it works with researchers, not outside.“ (http://vimeo.com/19758031
    )

    Comment by Christian — 16. Mai 2011 #

Sorry, the comment form is closed at this time.