Archiving Weblogs with ATOM and RFC 5005: An alternative to OAI-PMH
19. Oktober 2007 um 11:34 1 KommentarFollowing up to my recent post (in German) I had a conversation with my colleague about harvesting and archiving blogs and ATOM vs OAI-PMH. In my opinion with the recent RFC 5005 about Feed Paging and Archiving and its proposed extension of Archived Feeds ATOM can be an alternative to OAI-PMH. Instead of arguing which is better, digital libraries should support both for harvesting and providing archived publications such as preprints and weblog entries (scientific communication and publication already takes place in both).
Instead of having every project to implementing both protocols you could create a wrapper from ATOM with archived feeds to OAI-PMH and vice versa. The mapping from OAI-PMH to ATOM is probably the easier part: You partition the repository into chunks as defined in RFC 5005 with the from and until arguments of OAI-PMH. The mapping from OAI-PMH to ATOM is more complicated because you cannot select with timestamps. If you only specify a fromargument, the corresponding ATOM feed could be harvested going backwards in time but if there is an until argument you must harvest the whole archive just to get the first entries and and throw away the rest. Luckily the most frequent use case is to get the newest entries only. Anyway: Both protocols have their pros and cons and a two-way-wrapper could help both. Of course it should be implemented as open source so anyone can use it (by the way: There seems to be no OAI-crawler in Perl yet: Sure there is OAI-Harvester but for real-world applications you have to deal with unavailable servers, corrupt feeds, duplicated or deleted entries, and a way to save the harvested records, so a whole layer above the harvester is missing).
P.S.: At code4lib Ed Summers pointed me to Stuart Weibel who asked the same question about blog archiving, and to a discussion in John Udell’s blog that include blog archiving (he also mentions BlogML as a possible part of a solution – unluckily BlogML looks very dirty to me, the spec is here). And Daniel Chudnov drafted a blog mirroring architecture.
Second day at MTSR
18. Oktober 2007 um 18:46 Keine KommentareIt is already a week ago (conference blogging should be published immediately) so I better summarize my final notes of the MTSR conference 2007: Beitrag Second day at MTSR weiterlesen…
Introducing the Open Research Society
12. Oktober 2007 um 09:52 Keine KommentareAfter a short break at the MTSR 2007 in which I got to know Panayiota Polydoratou yesterday (greetings to Traugott Koch!), Miltiadis Lytras introduced the Open Research Society (ORS) and raised some important general questions: Why do we do research? Who can benefit from our research? Which alternatives to the current system of publication and review exist? How can we overcome the digital divide? The Open Research Society will also participates in the Open Knowledge Summit in Athens (24-26 September 2008) and it is going to publish a couple of new Open Access journals – have a look at their website and welcome this new organization in the area of Open Access and Open Content!
Miguel-Angel Sicilia explained the ORS plans in more detail with his presentation From open access to open research and information sustainability. The proposed ORS Journals (which ORS should not be limited to) are going to be full open access without author fees and all research data must be provided. Peer review is planned to be double-blind but there will be additional experiments with other review methods to find out how peer review could be changed. Sicilia also talked about Open Access and Information Sustainability which is a hard challenge given the explosion of publication.
My first impression of the Open Research Society is very promising – we should collaborate with Science Commons, Wikimedia and similar projects!
More MTSR 2007 presentations
11. Oktober 2007 um 15:54 Keine KommentareThe after-lunch session of MTSR 2007 contained five presentations:
Spyros Voulgaris presented a A Web Classifier for Semantic Classification Between News and Sports Broadcasts, that is an automated method of classification of news vs sports broadcast based on the properties of audio signal. The method does not require speech-recognition and is language independent. The audio signal is processed into a feature vector with is then fed to a neural network for classification. For feature extraction AMDF (Average Magnitude Difference Function) with segments of one to six seconds is used. As I do not know AMDF I cannot tell you more about this, nor what simple automatic binary classification has to do with semantics.
Based on classifier by Voulgaris et al. a Semiautomated tool for characterizing news video files, using metadata schemas was presented by Stefanos Asonitis. Their system consists of a web crawler for video content, the classifier, and an export of NewsML und SportsML which is partly derived from the sources and the classifier and party edited by users of the system.
For a scientometrics-lover like me Metadata Encoding for the Levels of Scientific Research presented by Nikoletta Peponi was highly interesting. Frankly speaking most of the schema is outdated (for instance the division into article, monograph, essay, and thesis), naive and incomplete. Whithout a set of examples and mapping to existing ontologies its pointless. But it’s an interesting beginning.
Sylvia Poulimenou (Metadata Encoding for the Documents based on the Rules of Diplomatics Science) presented an extension of TEI for diplomatics (the analysis and critical edition of documents to test their authenticity).
In the fifth presentation Mrs. Belesiotis talked about Ontology Oriented Support for the Teaching Process in the Greek Secondary Education. My knowledge of didactics is to low to write more about this, but the didactic of the presentation could habe been better then speedy reading the text on overfilled, too-many slides only. Maybe I just missed the point.
MTSR 2007 impressions
11. Oktober 2007 um 13:53 5 KommentareAlthough there is wireless all over the MTSR2007 conference place I have found no postings about MTSR2007 or MTSR07 so far, so I just summarize the talks in the first session I just watched:
Evangelos Sakkopoulus presented with Ontology-based Knowledge Acquisition through Semantic Profiling. An Application to the Cultural Heritage Domain a maintenance scheme for detecting the category a website may belong to for effective caching and client-side re-ranking websites in mobile applications. His approach is based on the observed browsing behaviour of „bursty cases“: a few web pages results’ categories are accessed frequently for short periods of time. Sakkopoulus uses the categories of the Open Directory Project (ODP), but the method could also be applied to other sets like the Wikipedia category system or library classifications.
In the second presentation Dimitrios Koutsomitropoulos talked with Ontology-based Knowledge Acquisition through Semantic Profiling about profiling CIDOC-CRM (ISO 21127:2006) by refining existing and adding new classes and properties with OWL. With additional restrictions and refinement you can increase expressiveness and better match a particular case. Well – this is nice, but in my experience we do not need more complexity and details but less, because the existing data is much less homogeneous and detailed then ontology theorists dream of. CIDOC-CRM is important but real-world applications will rather use simplifications of it.
Gianluca Colombo presented a Reference Ontology Design for a Neurovascular Knowledge Network. He researched how phenotypes in distributed clinical databases can be described with methods of bioinformatics to aggregate them jointly. He admitted that the most difficult part is mapping existing data to one common ontology.
Finally Irina Astrova presented Rule-Based Transformation of SQL Relational Databases to OWL Ontologies. I cannot judge her work because I don’t not the current research in mapping SQL data to the Semantic Web (it is surely a topic that many researcher deal with), but it looks more practical and relevant then the other presentations because most data exists in SQL databases. The implementation QUALEG DB can even do both way (SQL to OWL and OWL to SQL) as shown in another paper of hers – you can get get the software if you want (they are going to rename it).
The lunch was simple and the view over the bay is wonderful.
Presentation about versioned ISO-3166 in SKOS
11. Oktober 2007 um 00:03 1 KommentarUnlike my usual habbits and thanks to W-LAN in the hotel, the slides of my presentation Encoding changing country codes in RDF with ISO 3166 and SKOS at the second International Conference on Metadata and Semantics Research (MTSR2007) are ready and online even before the conference started! The full, detailed paper is not online yet because I am revising and correcting it (I found a very relevant paper after submission). And the serendipity effect of slideshare works: looking for other presentations about SKOS I stumbled upon the very interesting slides of Sebastion Kruk who works in the Corrib project on semantic web and digital libraries.
P.S: A preprint of the revised paper is available at arXiv.org.
Yet another semantic tagging application
14. September 2007 um 02:11 3 KommentareI just found another semantic tagging application: SemKey is also a Firefox-Plugin like EntityDescriber that I just wrote about. SemKey uses WordNet and Wikipedia as controlled vocabularies and help you to find the appropriate entry in them. Maurizio Tesconi and his colleauges describe SemKey in their paper SemKey: A Semantic Collaborative Tagging System at the WWW2007 Workshop on Tagging and Metadata for Social Information Organization (other papers linked here).
But the authors of SemKey don’t cite Gabrilovich and Markovitch (2006): Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge which is highly related (see also the following paper of Gabrilovich and Markovitch). Looks like both Marchetti et al. and their reviewers of the WWW 2007 workshop don’t know about their subject area. The feedback on SemKey is also little: This is science 1.0 about Web 2.0. Researchers 2.0 publishe their work on weblogs and preprint archives or even dare to fight in the jungle of Wikipedia to push forward knowledge instead of citation rank.
The Steve.museum tagging project
13. September 2007 um 22:56 1 KommentarSteve.museum is a tagging project that has been active for more then a year by now. Unlike other artificial prototype-tagging-research projects it is based on real world data: works of art in museum collections. Moreover its not available only by pictures in research-papers but available at sourceforge (written in PHP). More news about the project can be found at the Mailing list and the blogosphere. I stumbled upon jtran’s blog and his report from ASIST SIG-CR workshop on social cassification that took place in Texas last year. Hopefully someone from the steve.museum team will participate in Dublin Core conference 2008 or some other event the I participate!
Tagging enriched with controlled vocabularies
10. September 2007 um 03:36 7 KommentareFor Connotea there was published „Entity Describer“ (ED), an add-on tool that allows taggers to select terms from a controlled vocabulary such as MeSH. Background information can be found in the blog of its developer Benjamin Good. Up to now Entity Describer can only be used via a Greasemonkey script. [via Catalogoblog and netbib]
I bet soon there will be more tagging applications that support controlled vocabularies. For instance Sarah Hayman and Nick Lothian plan to extend the Education Network Australia (edna) with – how they call it – taxonomy-directed folksonomy. See their IFLA-Paper (that Patrick pointed me to) for more information.
Benjamin Good also wrote a paper about his work on ED and published it on his blog before even recieving reviewers comments. I like the following discussion on whether and how to publish it – a nice example of the changes in academic publishing. Now the paper is best available as preprint, identified with hdl:10101/npre.2007.945.1 and licensed under the Creative Commons Attribution 2.5 License (!). Thanks Benjamin and thanks to Nature for making this possible!
I already cited the work in an ongoing discussion about the Wikipedia-Article „Folksonomy. The discussion is mostly about words and I hate it. Good et al also contribute in confusion: Why do they have to introduce a new term („Semantic annotation means the association of a data entity with an element from a classification scheme“) instead of using existing vocabulary? A look at my typology of tagging systems could help clarification.
Well… or maybe tagging researchers just like to add synonyms and polysems because they are so used to them – a folksonomy will emerge anyhow so just call it how you like… 🙁
Persistent Identifiers: Irony of Fate or just absurd?
24. August 2007 um 01:20 4 KommentareThe report „Implementing Persistent Identifiers: overview of concepts, guidelines and recommendations“ shows you the impracticality of URN and URN:NBN – you do not even have to read any of the 70 report’s pages to find out: If you try the „persistent identifier“ http://nbn-resolving.de/urn:nbn:de:gbv:7-isbn-90-6984-508-3-8 to get the report’s PDF, you get the following message by a resolver at http://resolver.sub.uni-goettingen.de/purl/?isbn-90-6984-508-3:
Unfortunately the URL could not be resolved. None of the underlying local document resolver were able to find a document with the given identifier. Maybe one of the services is down or a document with the number doesn’t exist. As your URL should contain a persistent identifier, please check again later.
I’d call this 404 2.0! Furthermore at http://www.cerl.org/news.htm one of the report’s publisher (CERL) points out to a review of the report at http://www.clir.org/pubs/issues/index.html#found – which gives you the current issue of CLIR issues (printed version’s ISSN 1098-6383 is not mentioned anywhere) instead of http://www.clir.org/pubs/issues/issues55.html#found. If you ask Google for the title you easily find the PDF. If you ask WorldCat for the ISBN 90-6984-508-3 you get a record where you have to click and search a lot to guess which link will bring you to the PDF – but it’s only the unresolvable URN again.
If people are already to dump to use existing identifier systems (URL, ISBN, ISSN) in the right way, I strongly doubt that persistent identifier systems will solve any problem.
Neueste Kommentare