Archiving Weblogs with ATOM and RFC 5005: An alternative to OAI-PMH

19. Oktober 2007 um 11:34 1 Kommentar

Following up to my recent post (in German) I had a conversation with my colleague about harvesting and archiving blogs and ATOM vs OAI-PMH. In my opinion with the recent RFC 5005 about Feed Paging and Archiving and its proposed extension of Archived Feeds ATOM can be an alternative to OAI-PMH. Instead of arguing which is better, digital libraries should support both for harvesting and providing archived publications such as preprints and weblog entries (scientific communication and publication already takes place in both).

Instead of having every project to implementing both protocols you could create a wrapper from ATOM with archived feeds to OAI-PMH and vice versa. The mapping from OAI-PMH to ATOM is probably the easier part: You partition the repository into chunks as defined in RFC 5005 with the from and until arguments of OAI-PMH. The mapping from OAI-PMH to ATOM is more complicated because you cannot select with timestamps. If you only specify a fromargument, the corresponding ATOM feed could be harvested going backwards in time but if there is an until argument you must harvest the whole archive just to get the first entries and and throw away the rest. Luckily the most frequent use case is to get the newest entries only. Anyway: Both protocols have their pros and cons and a two-way-wrapper could help both. Of course it should be implemented as open source so anyone can use it (by the way: There seems to be no OAI-crawler in Perl yet: Sure there is OAI-Harvester but for real-world applications you have to deal with unavailable servers, corrupt feeds, duplicated or deleted entries, and a way to save the harvested records, so a whole layer above the harvester is missing).

P.S.: At code4lib Ed Summers pointed me to Stuart Weibel who asked the same question about blog archiving, and to a discussion in John Udell’s blog that include blog archiving (he also mentions BlogML as a possible part of a solution – unluckily BlogML looks very dirty to me, the spec is here). And Daniel Chudnov drafted a blog mirroring architecture.

Powered by WordPress with Theme based on Pool theme and Silk Icons.
Entries and comments feeds. Valid XHTML and CSS. ^Top^