Yesterday the DBPedia team released DBPedia Spotlight, a named entity recognition service based on structured data extracted from Wikipedia. You can access the service via Web APIs or download the software as Open Source. I could not resist to feed Spotlight its own description:
DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. Text annotation has the potential of enhancing a wide range of applications including search, faceted browsing and navigation. By connecting text documents with DBpedia, our system enables a range of interesting use cases. For instance, the ontology can be used as background knowledge to display complementary information on web pages or to enhance information retrieval tasks. Moreover, faceted browsing over documents and customization of web feeds based on semantics become feasible. Finally, by following links from DBpedia into other data sources, the Linked Open Data cloud is pulled closer to the Web of Documents.
Pretty cool, isn’t it? Natural Language Processing (NLP) for information extraction seems to be the next hype after Web 2.0 and Semantic Web. I don’t neglect the innovative capabilities of DBPedia Spotlight and similar tools, but you should never forget that these are just tools, which won’t automatically solve information problems, or replace all other tools. Given the example above, there is little chance that an automatic system will extract you an exact topic of the text (for instance “named entity recognition based on data extracted from Wikipedia”) because this requires much background knowledge combining domain-specific expertise with common sense. By the way: as long as both Wikipedia and NLP-software is mainly written by white males, the result of will always mirror a limited world-view.
You can compare the results of Spotlight with similar open services:
I found little overlap between the different services. Spotlight seems to provide more results (depending on the Text) on an error rate between 10% and 30%. You could use such tools for automatic subject indexing based on abstracts and use the result at least for ranking. Unfortunately in library metadata we often have no full text or abstract to annotate. Furthermore many library entities have no DBPedia entry but catalogers create new authority records if needed. What do you think, named entity recognition and other NLP techniques can be used for in metadata land? Can we give up controlled subject indexing in libraries in favour of automatic NLP-based indexing on the one side and social tagging on the other? Or is room for all of these approaches, and how can you successfully combine them?