This page presents various datasets related to the research of Andreas Thalhammer.

Wikidata PageRank

The Wikidata PageRank dataset was updated in January 2017 (originally created in August 2016) and provides PageRank scores for 16,249,698 Wikidata entities.

The creation of this dataset involved addressing a single challenge: different language editions of Wikipedia cover different pages. Using the biggest Wiki project (i.e., the English Wikipedia) would only cover ~5.9 million articles, and consequently only a fraction of Wikidata could be covered by PageRank on English Wikipedia. Also, the PageRank scores are usually strongly influenced by the specific language edition, for example the articles with highest PageRank socres in English, Russian, and Chinese Wikipedia are different from each other. This individal bias is expected when we treat the language editions separately (as we do in DBpedia) but it becomes a problem when we have "one knowledge base for all" (like in Wikidata). Therefore, we tried to merge 123 different link datasets from different language editions of Wikipedia (represented with Wikidata URIs). In particular, we merged the following input link datasets:

http://downloads.dbpedia.org/2016-04/core-i18n/LANG/page_links_wkd_uris_LANG.tql.bz2 with LANG one of:

en, de, fr, ja, sv, it, es, ru, nl, pl, pt, zh, ceb, war, uk, ca, vi, cs, no, hu, fi, ko, he, fa, ar, sh, ro, id, tr, sr, eo, da, bg, sk, lt, sl, ms, eu, hr, et, gl, el, hy, simple, th, la, nn, be, bs, kk, mk, ka, lv, oc, az, ur, hi, ta, cy, br, an, lb, ast, af, vo, tt, fy, te, is, bn, ce, sq, ml, pms, uz, gu, jv, mr, new, tl, sco, sw, io, nds, als, mg, pnb, qu, ba, ga, lmo, ht, cv, my, scn, ku, ne, mn, kn, bpy, tg, su, yi, gd, fo, ckb, nap, ky, arz, bar, wa, yo, ia, vec, pa, bug, sa, sah, am, mzn, si, nah, mt



We fused all these link datasets to one big link dataset. At this point, a particularity about links in Wikipedia becomes handy: in an article, a link to another article should only be set once ("Generally, a link should appear only once in an article,..."). This goes in line with the set semantics of DBpedia's link datasets; we can assume (and it is asserted by the DBpedia Extraction Framework) that none of the above datasets contains duplicate links. Therefore, we can treat links that occur multiple times in the common dataset, as a "vote" by its source (en, es, fr, ...). We computed the Wikidata PageRank dataset accordingly. As an example, consider the three following links (with their provenance):
[ A->B (en), A->B (zh), A->C (de)]  
In normal PageRank, A gives 2/3 of its PageRank to B and 1/3 of its PageRank to C (considering a damping factor of 1). Basically, this is how we computed the dataset (normal PageRank, but leveraging the one-link policy for less language-specifc bias). The detailed configuration was as follows: We could increase the coverage from ~5.94 million entities (biggest project, i.e. English Wikipedia) to ~16.25 million entities. We name this dataset: ULTIMATE. We also computed two rankings for comparision: EN (English Wikipedia only) and ULTIMATE \ EN (a merge of the 122 Wikipedia link datasets without English).

Download

All files are sorted by score (ascending).

== Wikidata ULTIMATE ==
== Wikidata top 10(en, es, fr, de, zh, ru, pt, it, ar, ja) ==
Outdated: merge of the "top 10 languages" in accordance to ten Wikipedia language editions with most and more than one million users (see List of Wikipedias by edits per article)

License and Attribution

The Wikidata PageRank data set is published under the Creative Commons Attribution-ShareAlike License .
Further information:

DBpedia PageRank

The latest version of the DBpedia PageRank dataset was created in August 2016. Older versions were created in May 2013 (3.8 en), December 2013 (3.9 en), June 2014 (3.9 es, de), September 2014 (2014 en, de, es, fr, it, ru), November 2014 (2014 zh), August 2015 (DBpedia 2015-04), and January 2016 (DBpedia 2015-10). To avoid confusion we would like to highlight that the dataset only uses dbp-ont:wikiPageWikiLink predicates for computing PageRank. There are two reasons why we call it "DBpedia PageRank" rather than "Wikipedia PageRank":

  1. The Wikipedia link structure is extracted by the DBpedia Extraction Framework.
  2. We use DBpedia URIs to identify resources and publish the results in Turtle format.
The following settings were used: Please ping me if you would like to see additional languages.

License and Attribution

The DBpedia PageRank data set is published under the Creative Commons Attribution-ShareAlike License .

Access

The English version of this dataset is deployed on the official DBpedia SPARQL endpoint in the named graph
http://people.aifb.kit.edu/ath/#DBpedia_PageRank


An example query is:
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbo:<http://dbpedia.org/ontology/>

PREFIX vrank:<http://purl.org/voc/vrank#>

SELECT ?s ?v 
FROM <http://dbpedia.org> 
FROM <http://people.aifb.kit.edu/ath/#DBpedia_PageRank> 
WHERE {
?s rdf:type dbo:University.
?s vrank:hasRank/vrank:rankValue ?v.
}
ORDER BY DESC(?v) LIMIT 50

Copy&Paste to http://dbpedia.org/sparql or click HERE.

For modeling the ranks and the attached scores we used the vRank vocabulary [1].

Download

All files are sorted by score (ascending).

== DBpedia2016-04 (NEW) ==
== DBpedia 2015-10 == Wikidata:
== DBpedia 2015-04 == Wikidata: == DBpedia 2014 == == DBpedia 3.9 == == DBpedia 3.8 ==



Applications

Acknowledgements

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 611346 and by the German Federal Ministry of Education and Research (BMBF) within the Software Campus project "SumOn" (grant no. 01IS12051).

xlime project
xLiMe Project
SumOn project
SumOn Project


Last modified: February 06 2017 21:30:39 UTC.