EDBT Summer School 2019

Lyon, Saint Germain au Mont D'Or, France
2nd September - 6th September 2019

Extracting Hidden Knowledge
from Heterogeneous Massive Data



Lyon, France


Lectures

Mining temporal networks.

Aristides Gionis Aristides Gionis. Aalto University - Finland. Polina Rozenshtein. Aalto University - Finland.
Networks (or graphs) are used to represent and analyze large datasets of objects and their relations. Typical examples of graph applications come from social networks, traffic networks, electric power grids, road systems, the Internet, chemical and biological systems, and more. Naturally, real-world networks have a temporal component: for instance, interactions between objects have a timestamp and a duration. In this tutorial we will present models and algorithms for mining temporal networks, i.e., network data with temporal information. We will overview different models used to represent network networks. We will highlight the main differences between static and temporal networks, and we will discuss the challenges arising from introducing the temporal dimension in the network representation. Finally, we will present recent papers addressing the most well-studied problems in the setting of temporal networks, including centrality measures, community detection and graph partitioning, event and anomaly detection, and network summarization.

Data Curation and Machine Learning.

Ihab F. Ilyas Ihab F. Ilyas. University of Waterloo - Canada.
Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. In this talk I discuss why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions. The talk focuses on two main problems: (1) entity consolidation, which is arguably the most difficult data curation challenge because it is notoriously complex and hard to scale; and (2) using probabilistic inference to suggest data repair for identified errors and anomalies using our new system called HoloCLean. Both problems have been challenging researchers and practitioners for decades due to the fundamentally combinatorial explosion in the space of solutions and the lack of ground truth. There’s a large body of work on this problem by both academia and industry. Techniques have included human curation, rules-based systems, and automatic discovery of clusters using predefined thresholds on record similarity Unfortunately, none of these techniques alone has been able to provide sufficient accuracy and scalability. The talk aims at providing deeper insight into the entity consolidation and data repair problems and discusses how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution.

Information extraction ​ with ​ document spanners and Big data analytics with logical formalisms.

Benny Kimelfeld Benny Kimelfeld. Technion - Israel.
The abundance and availability of valuable textual resources position text analytics as a standard component in data-driven workflows. To facilitate the incorporation of such resources, a core operation is the extraction of structured data from text, a classic task known as Information Extraction (IE). The lecture will begin with a short overview of the algorithmic concepts and techniques used for performing IE tasks, including declarative frameworks that provide abstractions and infrastructures for programming IE. The lecture will then focus on the concept of a "document spanner" that models an IE program as a function that takes as input a text document and produces a relation of spans (intervals in the document) over a predefined schema. For example, a well-studied language for expressing spanners is that of the "regular" spanners: relational algebra over regular expressions with capture variables. The lecture will cover recent advances in the theory of document spanners, including their expressive power and computational complexity, aspects of incompleteness and inconsistency, integration with structured databases, and compilation into parallel executions over document fragments. Finally, the lecture will list relevant open problems and future directions, including aspects of uncertainty and explainability.

Working with Knowledge Graphs.

Markus Krötzsch Markus Krötzsch. TU Dresden - Germany.
Knowledge graphs are an important asset for many AI applications, such as personal assistants and semantic search, and a valuable resource for related fields of research. They are also, of course, a conceptual umbrella that spans rather different methods and approaches in the intersection of databases, knowledge representation, information extraction, knowledge management, and web technologies. This course focusses on the effective handling and usage of knowledge graphs. For a concrete and motivating example, we will dive into Wikidata, the knowledge base of Wikipedia, as used, e.g., by Apple's Siri and Amazon's Alexa. We will discuss data access and query answering, and explain how to use this resource in own projects. Moving beyond the well-established technologies, we then take a look at advanced rule-based reasoning approaches for inferring implicit information and validating (possibly recursive) schema constraints. The course aims at combining principled insights from foundational research with pragmatic perspectives that can be put to use in hands-on exercises.

Data Management Challenges in Data Lakes.

Renée Miller. Northeastern University - USA.
The ubiquity of data lakes has created fascinating new challenges for data management research. In this tutorial, we review the state-of-the-art in data management for data lakes. We consider how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and meta-data management. We use Open Data (and a data lake created from Open Data) as a experimental platform for stress testing data management solutions for data discovery and data curation in data lakes.

Entity Resolution for Large-Scale Data.

Erhard Rahm Eric Peukert Erhard Rahm. University of Leipzig - Germany. Eric Peukert. University of Leipzig - Germany.
Advanced data analytics applications require scalable and highly effective approaches for data integration to combine the information from multiple, heterogeneous data sources. Entity resolution (ER) is a key step in this process that aims at identifying all representations of the same real-world entities, such as customers or products. The ER problem has already received much attention in research and practice as can be seen from the availability of many products and prototypical tools. Still there are significant problems needing further research such as support for holistic entity resolution to integrate data from many sources, e.g., to build knowledge graphs. Furthermore, there is a need for incremental ER approaches that can deal with dynamically changing data sources and support the seamless addition of new data sources. In the tutorial, we first provide an overview of the main steps in an entity resolution solution pipeline. We discuss methods to improve scalability, in particular blocking and parallel processing, e.g., based on Hadoop platforms, and related problems such as load balancing. A particular focus will be given to methods for entity clustering to holistically match entities from many sources. We further discuss ideas for incremental entity clustering and methods for repairing entity clusters.

EDBT Association website LIRIS Lab website CNRS website IDEXLYON website GOOGLE website NEO4J website