Dead End Electronic Resource Citation (ERC)

29. März 2013 um 11:51 Keine Kommentare

Tidying up my PhD notes, I found this short rant about „Electronic Resource Citation“. I have not used it anywhere, so I publish it here, licensed under CC-BY-SA.

Electronic Resource Citation (ERC) was introduced by John Kunze with a presentation at the International Conference on Dublin Core and Metadata Applications 2001 and with a paper in the Journal of Digital Information, Vol. 2, No 2 (2002). Kunze cited his paper in a call for an ERC Interest Group within the Dublin Core Metadata Initiative (DCMI) at the PERL4LIB mailing list, giving the following example of an ERC:

erc:  Kunze, John A. | A Metadata Kernel for Electronic Permanence
      | 20011106 | http://jodi.ecs.soton.ac.uk/Articles/v02/i02/Kunze/

An ERC is a minimal „kernel“ metadata record that consist of four elements: who, what, when and where. In the given example they are:

who:   Kunze, John A.
what:  A Metadata Kernel for Electronic Permanence
when:  20011106
where: http://jodi.ecs.soton.ac.uk/Articles/v02/i02/Kunze/

Ironically the given URL is obsolete, the host ‚jodi.ecs.soton.ac.uk‘ does not even exist anymore. The ERC is pretty useless if it just uses a fragile URL to cite a resource. How about some value that does not change over time, e.g:

where: Journal of Digital Information, Volume 2 Issue 2

As ERC is defined as „a location or machine-oriented identifier“, one could also use stable identifiers:

where: ISSN 1368-7506, Article No. 81

Both ISSN and article numbers 81 are much more identifiers then URLs. Citing an URL is more like

where: at the desk in the little reading room of my library

By the way the current location is http://www.rice.edu/perl4lib/archives/2002-09/msg00017.html – but who knows whether Texas A&M University will still host the journal at this URL in 20 years?

There are some interesting ideas in the original ERC proposal (different kinds of missing values, TEMPER date values, the four questions etc.), but its specification and implementation are just ridiculous and missing references to current technology (you know that you are doing something wrong in specification if you start to define your own encodings for characters, dates etc. instead of concentrating to your core subject and refering to existing specifications for the rest). The current draft (2010) is a typical example of badly mixing modeling and encoding issues and of loosing touch with existing, established data standards.

In addition to problems at the „low level“ of encoding, the „high level“ of conceptual modeling lacks appropriate references. What about the relation of ERC concepts to models such as FRBR and CIDOC-CRM? Why are ‚who‘, ‚when‘, ‚where‘, ‚what‘ the important metadata fields (in many cases the most interesting question is ‚why‘)? How about Ranganathan’s colon classification with personality, matter, energy, space, and time?

In summary the motivation behind ERC contains some good ideas, but its form is misdirected.

Citation parsing

24. Januar 2008 um 19:09 6 Kommentare

Citation Analysis is used to rate authors (problematic) and to find interesting papers (good idea). Citations of papers at the famous arXiv.org preprint server are analysed by CiteBase which is very useful. Unluckily it is buggy and does not alway work. I really wonder why the full text of a paper is parsed instead of using the BibTeX source. The citation parser ParaCite has been developed in the Open Citation Project. Since then it seems to be more or less abandoned. But it’s open source so you can test you papers before uploading and one could take the suiting parts to build a better citation parser. I found out that this way you can extract citations out of a document in $file (for instance a pdf) with perl (the needed modules are available at CPAN):

my $parser = Biblio::Citation::Parser::Citebase->new;
my $content = Biblio::Document::Parser::Utils::get_content( $file );
my $doc_parser = Biblio::Document::Parser::Brody->new;
my @references = $doc_parser->parse($content);

for (my $i=0; $i < @references; $i++) {
    my $metadata = $parser->parse( $references[$i] );
    print '[' . ($i+1) . '] ' . Dumper( $metadata ) . "\n";
}

In the documented that I tested there are almost always parsing errors, but better then nothing. I wonder what CiteSeer uses to extract citations? There is more action in citation parsing in the Zotero project – even an IDE called Scaffold to create new „translators“ that extract bibliographic data out of webpages. Another playing ground is Wikipedia which contains a growing number of references. And of course there are the commericla citation indexes like SCI. I thought to use citation data for additional catalog enrichement (in addition to ISBN2Wikipedia) but quality of data seems to be too low and identifiers are missing.

P.S: Right after writing this, I found Alf Eaton’s experiment with collecting together the conversations around a paper from various academic, news, blog and other discussion channels – as soon as you have identifiers (ISBN, URL, DOI, PMID…) the world gets connected 🙂

P.P.S: ParsCit seems to be a good new reference string parsing package (open source, written in Perl).

P.P.S: Konstantin Baierer manages a bibliography on citation parsing for his parser Citation::Multi::Parser.