MIA-Paris, AgroParisTech, INRA - Paris, France
Supervisors: Juliette Dibie, Liliana Ibanescu
LRI, Paris Sud University, CNRS - Orsay, France
Supervisors: Nathalie Pernelle, Fatiha Saïs
Graph-based representation of some knowledge in a machine-readable format
[Linked Data Principles, Tim Berners-Lee, 2006]
4th principle: "Link your data to other people’s data to provide context."
db:drug33 rdfs:label "Dolasetronum" .
db:drug33 db:associatedCondition db:nausea .
freebase:m.02jq1 freebase:objectName "Dolasetron" .
freebase:m.452as freebase:sell freebase:m.02jq1 .
freebase:m.452as freebase:location geonames:Paris .
db:drug33 owl:sameAs freebase:m.02jq1 .
Where to find a medicine for Nausea in Paris?
(data analysis, data integration → search engines, virtual assistants, recommender systems)
The Standard Semantic Web Identity Predicate
〈x,owl:sameAs,y〉
Same real-world entity with two names (IRIs): x and y
- Reflexive: ∀x →〈x,sameAs,x〉
- Symmetric:〈x,sameAs,y〉→〈y,sameAs,x〉
- Transitive:〈x,sameAs,y〉∧〈y,sameAs,z〉→〈x,sameAs,z〉
-〈x,sameAs,y〉∧〈x,p,z〉→〈y,p,z〉
The Web of Data contains a large* number of erroneous identity statements
How to limit the misuse of identity links in knowledge graphs?
Ⅰ. Identity Management Service
Ⅱ. Identity Invalidation Approach
Ⅲ. Contextual Identity Relation
The largest identity set contains 177,794 IRIs that 'should' represent different names to the same real-world entity
http://dbpedia.org/resource/Albert_Einstein
http://dbpedia.org/resource/Basketball
http://dbpedia.org/resource/Coca-Cola
http://dbpedia.org/resource/Deauville
http://dbpedia.org/resource/Italy
http://dbpedia.org/resource/Lists_of_christian_religions
...
Full list at: https://sameas.cc/term?id=4073
Scalable to the Web of Data
No assumptions on the data
(e.g. UNA, textual description, schema mappings)
High accuracy, precision and recall
Provide materialized and public results on real-world data
1. Extract the explicit identity network
(weighted network of owl:sameAs links)
2. Partition the sameAs network into equality sets
(connected component theoretically representing the same entity)
3. Detect the community structure within each equality set
0-0.2 | 0.2-0.4 | 0.4-0.6 | 0.6-0.8 | 0.8-1 | Total | |
---|---|---|---|---|---|---|
Total | 40 | 40 | 40 | 40 | 40 | 200 |
can't tell | 5 | 18 | 19 | 31 | 18 | 91 |
same | 35 (100%) |
22 (100%) |
18 (85.7%) |
7 (77.8%) |
15 (68.2%) |
97 (89%) |
related + unrelated | 0 (0%) |
0 (0%) |
3 (14.3%) |
2 (22.2%) |
7 (31.8%) |
12 (11%) |
1. The higher an error degree is, the more likely an owl:sameAs link is erroneous
2. All evaluated links with an error degree ≤ 0.4 are correct
Some of the detected owl:sameAs are clearly erroneous〈Bolivia, owl:sameAs, Albert_Einstein〉
Some identity assertions relate two closely related terms that are considered the same in some applications but not in others
(explains the disagreement between Semantic Web experts in judging owl:sameAs links [Halpin et al. 2010])
1. Explicitly associate the identity relation to a context
In which context “drug1” is considered as identical to “drug2”?
1. In a context where we discard the property “name” and “hasValue”
2. In a context where we discard the property “name” and the Weight of Lactose
1. Introduce the notion of a Global Context
(a global context is a sub-ontology represented as a Named Graph in RDF)
Given an RDFS Ontology O = (C, P, A) with
C = set of classes
P = set of properties
A = set of axioms (e.g. domains and ranges, subsumption)
A (Global) Context is a sub ontology GCu = (Cu, Pu, Au) with
Cu ⊆ DepC ⊆ C
Pu ⊆ P
Au = domain and range axioms more specific than those described in A
Some contexts are more specific than others (order relation)
GCu ≤ GCv
if Cv ⊆ Cu, Pv ⊆ Pu, and ∀p ∈ Pv:
domainv(p) ⊑ domainu(p) and rangev(p) ⊑ rangeu(p)
1. Introduce the notion of a Global Context
(a global context is a sub-ontology represented as a Named Graph in RDF)
(if their contextual descriptions are isomorphic up to a renaming of the instances IRI)
1. Introduce the notion of a Global Context
(a global context is a sub-ontology represented as a Named Graph in RDF)
2. Define the conditions in which two class instances are identical in a Global Context
(if their contextual descriptions are isomorphic up to a renaming of the instances IRI)
3. Detect the most specific Global Context(s) in which a pair of instances of a target class are identical
Can these identity links be detected in a real-world knowledge graph?
What is the benefit of having contextual identity links in a knowledge graph?
Conditions tend to change even slightly from one experiment to another
→ things can rarely be declared the same
(e.g. not the exact same materials, different sample size, different experts)
Identity depends on each expert and application
→ impossible that experts manually provide all possible contexts
Complex Ontologies
(large number of classes, different granularity, long paths for attaining literal values)
1. Syntactic Problems
heterogeneity of the formats in which scientific data are published
2. Semantic Problems
terminological variations encountered across the multiple scientific datasets
(e.g. synonyms, aliases, multilingualism)
Mixture | Step | |
---|---|---|
# Individuals of target class | 1,187 | 581 |
# Possible pairs | 703,891 | 168,490 |
# Different Global Contexts | 2,232 | 718 |
# Identity Links | 1,279,376 | 348,017 |
# Most Specific Global Context per pair | 1.81 | 2.06 |
identical〈GCi〉(x, y) ∧ observationMeasure(x, m1)
→ observationMeasure(y, m2) with m1 ≃ m2
Example: identical〈GC3〉(x, y) → same(pH)
[4.5% error rate; 647 support]
x and y have the same citric acid weight
How to limit the misuse of identity links
in Knowledge Graphs?
1. All evaluated links with an error degree ≤ 0.4 are correct
2. The higher an error degree is, the more likely an owl:sameAs link is erroneous
TN | TP | FN | FP | Total | |
---|---|---|---|---|---|
Sample 1 [200 random sameAs] | 97 | 0 | 12 | 0 | 109 |
Sample 2 [60 sameAs > 0.9] | 6 | 20 | 5 | 8 | 39 |
Sample 3 [Obama Equality Set] | 30 | 2 | 0 | 0 | 32 |
Sample 4 [DBPedia-Freebase sameAs] | 77 | 0 | 0 | 1 | 78 |
Sample 5 [Injected erroneous sameAs] | 0 | 725 | 55 | 0 | 780 |
Total | 210 | 747 | 72 | 9 | 1038 |
Symmetrical | Non-symmetrical | Total | |
---|---|---|---|
Total | 94 | 164 | 258 |
same | 92 (98%) |
127 (77%) |
219 (85%) |
related + unrelated | 2 (2%) |
37 (23%) |
39 (15%) |
1. Symmetrical owl:sameAs links are more likely to be correct
Manual evaluation of 30 sameAs links with error degree > 0.99 + in the largest equality set
Only 2 erroneous owl:sameAs
(17 correct and 11 can't tell)
Precision drops from 88% to 11%
(in the largest equality set)
freebase:m.05b6w1g owl:sameAs dbr:President_Barack_Obama
freebase:m.05b6w1g owl:sameAs dbr:President_Obama
freebase:m.05b6w1g freebase:object.name "Presidency of Barack Obama"@en
1. nbr of symmetrical sameAs
(450M owl:sameAs with 98% chance of correctness)
2. nbr of non-symmetrical sameAs with err degree ≤0.99
(105M owl:sameAs with a 88.6% chance of correctness)
3. nbr of non-symmetrical sameAs with err degree >0.99
(1.2M owl:sameAs with an erroneous probability between 40 and 88%)
We estimate that 4% of owl:sameAs links in the LOD are erroneous
(around 22.5M links)
previous estimations:
2.8% by [Hogan et al. 2012] and 21% by [Halpin et al. 2010]