Citation parsing « Jakoblog — Das Weblog von Jakob Voß

Blog
About

Citation parsing

24. Januar 2008 um 19:09 6 Kommentare

Citation Analysis is used to rate authors (problematic) and to find interesting papers (good idea). Citations of papers at the famous arXiv.org preprint server are analysed by CiteBase which is very useful. Unluckily it is buggy and does not alway work. I really wonder why the full text of a paper is parsed instead of using the BibTeX source. The citation parser ParaCite has been developed in the Open Citation Project. Since then it seems to be more or less abandoned. But it’s open source so you can test you papers before uploading and one could take the suiting parts to build a better citation parser. I found out that this way you can extract citations out of a document in $file (for instance a pdf) with perl (the needed modules are available at CPAN):

my $parser = Biblio::Citation::Parser::Citebase->new;
my $content = Biblio::Document::Parser::Utils::get_content( $file );
my $doc_parser = Biblio::Document::Parser::Brody->new;
my @references = $doc_parser->parse($content);

for (my $i=0; $i < @references; $i++) {
    my $metadata = $parser->parse( $references[$i] );
    print '[' . ($i+1) . '] ' . Dumper( $metadata ) . "\n";
}

In the documented that I tested there are almost always parsing errors, but better then nothing. I wonder what CiteSeer uses to extract citations? There is more action in citation parsing in the Zotero project – even an IDE called Scaffold to create new „translators“ that extract bibliographic data out of webpages. Another playing ground is Wikipedia which contains a growing number of references. And of course there are the commericla citation indexes like SCI. I thought to use citation data for additional catalog enrichement (in addition to ISBN2Wikipedia) but quality of data seems to be too low and identifiers are missing.

P.S: Right after writing this, I found Alf Eaton’s experiment with collecting together the conversations around a paper from various academic, news, blog and other discussion channels – as soon as you have identifiers (ISBN, URL, DOI, PMID…) the world gets connected 🙂

P.P.S: ParsCit seems to be a good new reference string parsing package (open source, written in Perl).

P.P.S: Konstantin Baierer manages a bibliography on citation parsing for his parser Citation::Multi::Parser.

Tags: Citation, Identifier, Metadata, Perl 6 Kommentare

6 Comments »

RSS feed for comments on this post. TrackBack URI

How is ISI doing that nowadays? Their business is build on citations… When I visited them about 9 years ago they processed mostly paper articles (that must have changed?!?). They scanned them and marked citations on the images manually, then did OCR on those parts and (if I remember correctly) parsed those OCR results automatically with manual postprocessing (to check results) after that. I was quite impressed by the highly ergonomic software they had developed for the whole process.
What are they doing today to get citations? I think they must have good parsers…

Comment by till — 25. Januar 2008 #
Funny, some weeks ago I did some experimentation with the Citebase module. There was a great variety in the quality of the parsed results: Regular journal articles (in APA style) don’t seem to be problematic but the results get significantly worse for proceedings or chapters in edited books. I’d like to know what algorithms Google, CiteSeer (or ISI) are using…

Peter

Comment by Peter — 25. Januar 2008 #
ParCit seems pretty exciting to me.

My interest in this is with the Umlaut open source link resolver–I’d like to be able to allow users to paste in natural language citations, and then have Umlaut services trigger. Right now, in every link resolver I know of, they need to enter fields themselves.

Comment by Jonathan Rochkind — 8. Mai 2008 #
ParsCit (mentioned in your P.P.S.) is now used to power CiteSeer^x. I (and past colleagues) have been collaborating with the PSU team extensively to do extensive field testing of the system.

ParsCit has been successfully used to parse over 20 million citations within the CiteSeer^x framework (I’m quoting Isaac Councill on this figure).

We (especially me) would be very happy to further get feedback on how the ParsCit package can help others in the community.

Cheers,

Min

Comment by Min-Yen Kan — 12. Mai 2008 #
I project that I am going to do since a long time is parsing citations in Wikipedia articles. There will be a paper on citation analysis and Wikipedia at this years Wikimania, but it is still pretty limited. Most citations in Wikipedia are in free text with some formatting like in any other paper. I will try ParCit but not before autum I think.

Comment by Jakob — 12. Mai 2008 #
Half a year ago I developed a fuzzy based concept for extracting citation elements (article/chapter title, journal title / book title, author, Vol, No., pages). Citations can be checked title by title in Google Scholar/Books and SFX for example and can beconverted to endnote (for import by Zotero or Citavi) Some parts of the concept can be tested in the free flash/php web tool:

You can paste your own list instead of the example list (after refreshing the page)

Comment by erik senst — 11. Dezember 2008 #

Sorry, the comment form is closed at this time.

Jakoblog — Das Weblog von Jakob Voß