Michael
Färber, Achim Rettinger and Boulos El Asmar
On Emerging Entity Detection
Abstract: While
large Knowledge Graphs (KGs) already cover a broad range of domains to
an extent sufficient for general use, they typically lack emerging
entities that are just starting to attract the public interest. This
disqualifies such KGs for tasks like entity-based media monitoring,
since a large portion of news inherently covers entities that have not
been noted by the public before. Such entities are unlinkable, which
ultimately means, they cannot be monitored in media streams. This is
the first paper that thoroughly investigates all types of challenges
that arise from out-of-KG entities for entity linking tasks. By
large-scale analytics of news streams we quantify the importance of
each challenge for real-world applications. We then propose a machine
learning approach which tackles the most frequent but least
investigated challenge, i.e., when entities are missing in the KG and
cannot be considered by entity linking systems. We construct a publicly
available benchmark data set based on English news articles and editing
behavior on Wikipedia. Our experiments show that predicting whether an
entity will be added to Wikipedia is challenging. However, we can
reliably identify emerging entities that could be added to the KG
according to Wikipedia's own notability criteria.
Supplementary Material:
Wikipedia Diff:
The following data is available upon request (michael [at] faerber [dot] edu) following data protection laws:
- Entities inserted into
Wikipedia between 2015-04-04 and 2015-05-15
- "Real novel entities" inserted into Wikipedia between 2015-04-04 and 2015-05-15 (excluding insertions of redirects and pure changes of the Wikipedia page titles)
- Surface forms added to Wikipedia
between 2015-04-04 and 2015-05-15
- Entities deleted from
Wikipedia between 2015-04-04 and 2015-05-15
Entity Linking Challenges "in the Wild":
Benchmarks for Emerging Entity Detection:
- Complete data set
(only initial filtering performed, e.g., considering only noun phrases
with at least three alphanumeric characters)
- Filtered data set (restricted to named entities*; this data set contains all features per noun phrase)
- Data set with NP series of filtered data set (used in experiments):
* determined by NERC tagging of the Stanford parser
Manual Evaluation of Top 100 Entity Candidates Classified as False Positive:
Wikipedia Notability Guidelines:
See
https://en.wikipedia.org/wiki/Wikipedia:Notability and our
quantification for the manual evaluation.