Michael Färber, Achim Rettinger and Boulos El Asmar
On Emerging Entity Detection

Abstract: While large Knowledge Graphs (KGs) already cover a broad range of domains to an extent sufficient for general use, they typically lack emerging entities that are just starting to attract the public interest. This disqualifies such KGs for tasks like entity-based media monitoring, since a large portion of news inherently covers entities that have not been noted by the public before. Such entities are unlinkable, which ultimately means, they cannot be monitored in media streams. This is the first paper that thoroughly investigates all types of challenges that arise from out-of-KG entities for entity linking tasks. By large-scale analytics of news streams we quantify the importance of each challenge for real-world applications. We then propose a machine learning approach which tackles the most frequent but least investigated challenge, i.e., when entities are missing in the KG and cannot be considered by entity linking systems. We construct a publicly available benchmark data set based on English news articles and editing behavior on Wikipedia. Our experiments show that predicting whether an entity will be added to Wikipedia is challenging. However, we can reliably identify emerging entities that could be added to the KG according to Wikipedia's own notability criteria.

Supplementary Material:

Wikipedia Diff:

The following data is available upon request (michael [at] faerber [dot] edu) following data protection laws:
  • Entities inserted into Wikipedia between 2015-04-04 and 2015-05-15
  • "Real novel entities" inserted into Wikipedia between 2015-04-04 and 2015-05-15 (excluding insertions of redirects and pure changes of the Wikipedia page titles)
  • Surface forms added to Wikipedia between 2015-04-04 and 2015-05-15
  • Entities deleted from Wikipedia between 2015-04-04 and 2015-05-15

Entity Linking Challenges "in the Wild":

Benchmarks for Emerging Entity Detection:

* determined by NERC tagging of the Stanford parser

Manual Evaluation of Top 100 Entity Candidates Classified as False Positive:

Wikipedia Notability Guidelines:

See https://en.wikipedia.org/wiki/Wikipedia:Notability and our quantification for the manual evaluation.

By Michael Färber, Achim Rettinger, and Boulos El Asmar, 2016