Named Entity Recognition with DBPedia « Jakoblog

Blog
About

Named Entity Recognition with DBPedia

15. Februar 2011 um 14:55 5 Kommentare

Yesterday the DBPedia team released DBPedia Spotlight, a named entity recognition service based on structured data extracted from Wikipedia. You can access the service via Web APIs or download the software as Open Source. I could not resist to feed Spotlight its own description:

DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. Text annotation has the potential of enhancing a wide range of applications including search, faceted browsing and navigation. By connecting text documents with DBpedia, our system enables a range of interesting use cases. For instance, the ontology can be used as background knowledge to display complementary information on web pages or to enhance information retrieval tasks. Moreover, faceted browsing over documents and customization of web feeds based on semantics become feasible. Finally, by following links from DBpedia into other data sources, the Linked Open Data cloud is pulled closer to the Web of Documents.

Pretty cool, isn’t it? Natural Language Processing (NLP) for information extraction seems to be the next hype after Web 2.0 and Semantic Web. I don’t neglect the innovative capabilities of DBPedia Spotlight and similar tools, but you should never forget that these are just tools, which won’t automatically solve information problems, or replace all other tools. Given the example above, there is little chance that an automatic system will extract you an exact topic of the text (for instance „named entity recognition based on data extracted from Wikipedia“) because this requires much background knowledge combining domain-specific expertise with common sense. By the way: as long as both Wikipedia and NLP-software is mainly written by white males, the result of will always mirror a limited world-view.

You can compare the results of Spotlight with similar open services:

I found little overlap between the different services. Spotlight seems to provide more results (depending on the Text) on an error rate between 10% and 30%. You could use such tools for automatic subject indexing based on abstracts and use the result at least for ranking. Unfortunately in library metadata we often have no full text or abstract to annotate. Furthermore many library entities have no DBPedia entry but catalogers create new authority records if needed. What do you think, named entity recognition and other NLP techniques can be used for in metadata land? Can we give up controlled subject indexing in libraries in favour of automatic NLP-based indexing on the one side and social tagging on the other? Or is room for all of these approaches, and how can you successfully combine them?

Tags: NLP, Semantic Web, Wikipedia 5 Kommentare

5 Comments »

RSS feed for comments on this post. TrackBack URI

I think DBPedia Spotlight has quite high error rate, so the librarians might want something more conservative that has perhaps lower number of named entities recognized but these entities are more accurate.

Regarding your question about replacing/supporting subject indexing in libraries with such a service, I think we need to re-evaluate why we do manual (a.k.a. intellectual) indexing for digital full-text resources, where we can you powerful information retrieval techniques. Manual indexing will still be the best option in case you don’t have access to the full-text of the indexed document.

I think there is a „schism“ since the 60s when the Cranfield studies by Cyril Cleverdon were published. In them it was shown that manual indexing does not significantly improve information retrieval and the best results were achieved for the retrieval based on full-text. However, even with these results, libraries continued to do sophisticated manual indexing. But there was another side: information retrieval based on algorithmical processing of the full-text (Google and the like); and this side was quite successful.

To sum up, I think we should think about the purpose with which we’re doing the indexing. If it is for search, it is probably not that useful to have detailed metadata in place. However, I think manual subject indexing is still quite relevant for browsing interfaces, such as faceted navigation. And, as I have mentioned previously, if you don’t have the full-text, you need to create metadata manually. If I understood it correctly, you’re suggesting applying NLP techniques on metadata records. I think this won’t work very well: you’re not using a natural language and you’re missing contextual information useful for named entity disambiguation. Few months back, I have tried running OpenCalais extraction on some of our bibliographic records with not very good results.

Sorry for a long comment. 🙂

Comment by Jindřich Mynarz — 15. Februar 2011 #
Hi Jakob,
Thanks for testing DBpedia Spotlight and sharing your thoughts.

We say that DBpedia Spotlight does „text annotation“, which I recognize is a quite generic term. 🙂 That is because we would like it to be used for several tasks, spanning Entity Linking, Named Entity Recognition up to Topic Classification. This doesn’t mean, though, that it already solves all those problems. There is an inherent duality between making a generic tool and solving a specific problem.

We envision DBpedia Spotlight featuring as part of specialized applications where you can tune it for your needs given the requirements at hand.

For example if „librarians might want something more conservative that has perhaps lower number of named entities recognized but these entities are more accurate“, as Jindřich recognizes, one can set higher values for our parameters of confidence and support.

For other applications, it may be more useful to increase recall (with the risk of more errors) because somewhere downstream some other process has the ability of fixing some (for example, a human that will manually oversee the annotation).
In those cases, lower confidence values can be set.

I hope this helps to clarify a bit.

Best regards,
Pablo

Comment by Pablo Mendes — 15. Februar 2011 #
I agree with Jindřich – one reason Google’s analyzing is so powerful is that they have an enormous corpus of full text which to apply NLP to. Text mining is also part of the digital humanities. So where do libraries fit in, and what could be their role in fields where it seems others are already doing a good job? So far they only work with the „surrogate“ (i.e. bibliographic description) of the „real thing“, representations of the content (which is still primarily held in print form). But this alone won’t allow us to move forward, and in the long run it won’t be enough in the digital age and with examples such as Google setting the benchmark.

Comment by Saskia — 16. Februar 2011 #
What do you mean by „Natural Language Processing (NLP) for information extraction seems to be the next hype after Web 2.0 and Semantic Web.“ ?
Using NLP for IE is standard, already since the early 90th and today already on web-scale.

Comment by Günter — 18. Februar 2011 #
Günter, you are right: NLP is used since decades. But Web 2.0 and Semantic Web were also about propagating ideas that had been used since decades. A hype does not imply that something is new, but it must look new and it must generate high expectations. I think that this will happen with NLP on the Web.

Comment by jakob — 24. Februar 2011 #

Sorry, the comment form is closed at this time.

Jakoblog — Das Weblog von Jakob Voß