Complementary datasets linking by means of knowledge graph augmentation and multimodal embeddings
Employer University of Montpellier
School Doctorate School I2S, PhD in Informatics
When Fall 2019
Duration 36 months
Where LIRMM, Montpellier, France
Semantic web, data linking, AI, Machine learning, Embeddings, Text and data mining.
Machine learning (Scikit Learn, Glove, word2vec, etc.), Semantic web technologies (OWL, RDF, SPARQL, triplestore, Linked data)
Web data linking is defined as the problem of integrating heterogeneous datasets structured as knowledge graphs. This thesis aims to build on current solutions and go beyond state-of-the-art by proposing methods to interlink complementary datasets. Two datasets are said to be complementary if the entities that they contain are described by vastly non-intersecting sets of properties. We will develop methodological solutions and tools based on (1) knowledge graph augmentation techniques and (2) combined multimodal (text, graphs) embeddings of entities. The new methods, as generic as possible, will be experimented in the context of the AgroLD knowledge base (www.agrold.org), which brings together a large number of agronomy datasets. Work will be carried out in close collaboration with the partners of the ANR D2KAB project (www.d2kab.org).
Linking (or interconnecting) data is an active research domain that aims to establish semantic links between entities described in different datasets. We are interested here in data represented as RDF (Resource Description Framework) knowledge graphs, published on the web as part of the collaborative Linked Open Data initiative, which today hosts more than 1100 datasets. The semantic links that we seek to establish are those of identity, expressed by the "owl: sameAs" property of the vocabulary OWL (Web Ontology Language). The difficulty comes from the high heterogeneity of the descriptions of entities that can be found in different graphs [1,2]. The majority of existing linking tools are based on the assumption that for each pair of potentially matching entities, there is a common subset of properties that will help infer the identity link (or the lack thereof) . However, in a number of real-world cases, this intersection is very weak or non-existent – we are then talking of complementary datasets. In particular, we are interested in agronomic data issued from the AgroLD knowledge base  (www.agrold.org), which show this problem.
The question arises then where to look for information to compare entities in different datasets. In a number of cases this information is present in the graphs, but in an unstructured form (e.g., in textual comments or annotations). Text specific knowledge extraction methods may then be applied to extract this information. For example, a particularity of AgroLD's data is that most of them contain text fields that are not described using standardized terminologies or ontologies. As a result, the discoveries that could be made by searching these resources are limited. We will therefore be interested in the automatic extraction of entities of interest and relations from these textual fields in order to structure and render usable the information contained therein [3,4]. For this, the AgroPortal Annotator will be a possible tool to use [8,9].
On the other hand, a number of knowledge base augmentation approaches exist, which make it possible to complete automatically the missing knowledge by using background knowledge information contained in other established knowledge graphs (DBpedia, Wikidata, etc.). We propose here to use and adapt these methods for the particular task of linking complementary datasets by automatically increasing the knowledge in these datasets to allow comparison. In addition, we will use textual data (including scientific articles) for the knowledge-building task. We will tackle the question of the definition and application of lexical embeddings of entities of different modalities (text, graph, social networks) [6,7] that will allow the semantic comparison of these entities.
Tasks to accomplish:
We are looking for a motivated junior researcher with experience in machine learning and semantic web technologies. The candidate will demonstrate aptitudes or matches with most of the following aspects:
- High motivation for scientific research
- Knowledge of semantic web technologies, especially JSON/RDF/SPARQL.
- Experience with machine learning tools (e.g., Python’s Scikit Learn)
- Knowledge of text and data mining techniques (named entity recognition)
- Excellent technical skills to conduct experiments with real-world and benchmark data
- Perfect English oral and writing skills
- Autonomy and initiative, take on technical decisions within the project and justify choices
- Basic knowledge of French with objective to learn the language during the contract
- Excellent writing skills as reports, documentation, and technical notes will always be necessary
Application for this position will EXCLUSIVELY BY ACCEPTED via the following platform:
Documents required are (include everything in one single PDF file):
- a curriculum vitae describing your education and experience;
- a motivation letter describing your interest in the position and the matches with the expected profile;
- link to your master thesis or a relevant related publications;
- copies of your transcripts of records (master, bachelor);
- names and contact details of referees.
No application by email will be accepted, but for more information about this position, please contact Konstantin Todorov (firstname.lastname@example.org) and Clement Jonquet (email@example.com). Please avoid attached documents and include links if you would like to send a document.
Remote and face to face interviews will be organized.
The successful candidate will hold a scholarship from the French ministry of Higher Education Research and Innovation (1600€/month) for a three years period of time. Social security and benefits are included. Possibility to complement with teaching activities.
 Achichi, M., Bellahsene, Z., Ben Ellefi, M.,Todorov, K. (2019, in print) Linking and Disambiguating Entities Across Heterogeneous RDF Graphs. Journal of Web Semantics.
 Manel Achichi, Zohra Bellahsene, Konstantin Todorov: A survey on web data linking. Ingénierie des Systèmes d'Information (ISI) 21(5-6): 11-29 (2016)
 Rafael Vieira and Kate Revoredo. Using Word Semantics on Entity Names for Correspondence Set Generation. OAEI 2017 challenge.
 Yuanzhe Zhang, Xuepeng Wang, Siwei Lai, Shizhu He, Kang Liu, Jun Zhao, and Xueqiang Lv. Ontology Matching with Word Embeddings. CNC, CCL 2014. LNCS, vol. 8801
 Aravind Venkatesan, Gildas Tagny Ngompe, Nordine El Hassouni, Imene Chentli, Valentin Guignon, Clement Jonquet, Manuel Ruiz, Pierre Larmande. Agronomic Linked Data (AgroLD): a Knowledge-based System to Enable Integrative Biology in Agronomy. 2018. Plos One.
 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J.. "Distributed representations of words and phrases and their compositionality." In ANIPS 2013.
 Ristoski, P., and Heiko P.. "Rdf2vec: Rdf graph embeddings for data mining." In ISWC, 2016.
 Jonquet, C., Toulet, A., Arnaud, E., Aubin, S., Yeumo, E. D., Emonet, V., ... & Larmande, P. (2018). AgroPortal: A vocabulary and ontology repository for agronomy. Computers and Electronics in Agriculture, 144, 126-143.
 Tchechmedjiev, A., Abdaoui, A., Emonet, V., Zevio, S., & Jonquet, C. (2018). SIFR annotator: ontology-based semantic annotation of French biomedical text and clinical notes. BMC bioinformatics, 19(1), 405.Lire la suite