Kommentare zu: Citation parsing

Von: erik senst

erik senst — Thu, 11 Dec 2008 11:26:26 +0000

Half a year ago I developed a fuzzy based concept for extracting citation elements (article/chapter title, journal title / book title, author, Vol, No., pages). Citations can be checked title by title in Google Scholar/Books and SFX for example and can beconverted to endnote (for import by Zotero or Citavi) Some parts of the concept can be tested in the free flash/php web tool: You can paste your own list instead of the example list (after refreshing the page)

Von: Jakob

Jakob — Mon, 12 May 2008 19:27:57 +0000

I project that I am going to do since a long time is parsing citations in Wikipedia articles. There will be a paper on citation analysis and Wikipedia at this years Wikimania, but it is still pretty limited. Most citations in Wikipedia are in free text with some formatting like in any other paper. I will try ParCit but not before autum I think.

Von: Min-Yen Kan

Min-Yen Kan — Mon, 12 May 2008 08:01:26 +0000

ParsCit (mentioned in your P.P.S.) is now used to power CiteSeer^x. I (and past colleagues) have been collaborating with the PSU team extensively to do extensive field testing of the system.

ParsCit has been successfully used to parse over 20 million citations within the CiteSeer^x framework (I’m quoting Isaac Councill on this figure).

We (especially me) would be very happy to further get feedback on how the ParsCit package can help others in the community.

Cheers,

Min

Von: Jonathan Rochkind

Jonathan Rochkind — Thu, 08 May 2008 13:48:36 +0000

ParCit seems pretty exciting to me.

My interest in this is with the Umlaut open source link resolver–I’d like to be able to allow users to paste in natural language citations, and then have Umlaut services trigger. Right now, in every link resolver I know of, they need to enter fields themselves.

Von: Peter

Peter — Fri, 25 Jan 2008 11:26:34 +0000

Funny, some weeks ago I did some experimentation with the Citebase module. There was a great variety in the quality of the parsed results: Regular journal articles (in APA style) don’t seem to be problematic but the results get significantly worse for proceedings or chapters in edited books. I’d like to know what algorithms Google, CiteSeer (or ISI) are using…

Peter

Von: till

till — Fri, 25 Jan 2008 07:46:51 +0000

How is ISI doing that nowadays? Their business is build on citations... When I visited them about 9 years ago they processed mostly paper articles (that must have changed?!?). They scanned them and marked citations on the images manually, then did OCR on those parts and (if I remember correctly) parsed those OCR results automatically with manual postprocessing (to check results) after that. I was quite impressed by the highly ergonomic software they had developed for the whole process. What are they doing today to get citations? I think they must have good parsers...