Data management seminars in Paris and Île-de-France: 2016

Wednesday, September 21, 2016

Seminar by Stefano Ceri on Thursday September 29th at Télécom ParisTech

Stefano Ceri, Professor at Politecnico di Milano, will give a talk on Thursday, September 29th, 2016, 14:30, in Amphi Jade, Télécom ParisTech, 46 rue Barrault (Paris 13).

Data-Driven Genomic Computing

Abstract

Genomic computing is a new science focused on understanding the functioning of the genome, as a premise to fundamental discoveries in biology and medicine. Next Generation Sequencing (NGS) allows the production of the entire human genome sequence at a cost of about 1000 US $; many algorithms exist for the extraction of genome features, or "signals", including peaks (enriched regions), mutations, or gene expression (intensity of transcription activity). The missing gap is a system supporting data integration and exploration, giving a “biological meaning” to all the available information; such a system can be used, e.g., for better understanding cancer or how environment influences cancer development.
The GeCo Project (Data-Driven Genomic Computing, ERC Advanced Grant currently undergoing the contract preparation) has the objective or revisiting genomic computing through the lens of basic data management, through models, languages, and instruments; the research group of DEIB is among the few which are centering their focus on genomic data integration. Starting from an abstract model, we already developed a system that can be used to query processed data produced by several large Genomic Consortia, including Encode and TCGA; the system employs internally the Spark, Flink, and SciDB data engines, and prototypes can already be accessed from Cineca servers or be downloaded from PoliMi servers. During the five-years of the ERC project, the system will be enriched with data analysis tools and environments and will be made increasingly efficient.
Most diseases have a genetic component, hence a system which is capable of integrating “big data” of genomics is of paramount importance. Among the objectives of the project, the creation of an “open source” system available to biological and clinical research; while the GeCo project will provide public services which only use public data (anonymized and made available for secondary use, i.e., knowledge discovery), the use of the GeCo system within protected clinical contexts will enable personalized medicine, i.e. the adaptation of therapies to specific genetic features of patients. The most ambitious objective is the development, during the 5-years ERC project, of an “Internet for Genomics”, i.e. a protocol for collecting data from Consortia and individual researchers, and a “Google for Genomics”, supporting indexing and search over huge collections of genomic datasets.

Bio

Stefano Ceri about himself:
I am professor of Database Systems at the Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) of Politecnico di Milano. I was visiting professor at the Computer Science Department of Stanford University (1983-1990). I was the chairman of the Computer Science Section of DEI (1992-2004), and the chairman of LaureaOnLIne, a fully online curriculum in Computer Engineering (2004-2008).
I was the director of Alta Scuola Politecnica, the school of excellence for top-level master students selected from Engineering, Architecture, and Design Faculties of Politecnico di Milano and Politecnico di Torino (October 2010 - September 2013).
I was associate editor of ACM-Transactions on Database Systems and IEEE-Transactions on Software Engineering, and I am currently an associated editor of several international journals. I am co-editor in chief (with Mike Carey) of the book series "Data Centric Systems and Applications"(Springer-Verlag).
I am a member of the Executive Committee of ALFC - Associazione Lombarda Fibrosi Cistica (April 2013 - April 2016).
I am the recipient of the ACM-SIGMOD "Edward T. Codd Innovation Award" (New York, June 26, 2013). I am an ACM Fellow and a member of Academia Europaea.

Wednesday, April 13, 2016

Seminar by Paolo Papotti on Monday April 18th at Télécom ParisTech

Paolo Papotti, Assistant Professor at Arizona State University, will give a talk on Monday, April 18th, 2016, 15:00, in Amphi Rubis, Télécom ParisTech, 46 rue Barrault (Paris 13).

Data Cleaning in the Big Data era

Abstract

In the “big data” era, data is often dirty in nature because of several reasons, such as typos, missing values, and duplicates. The intrinsic problem with dirty data is that it can lead to poor results in analytic tasks. Therefore, data cleaning is an unavoidable task in data preparation to have reliable data for final applications, such as querying and mining. Unfortunately, data cleaning is hard in practice and it requires a great amount of manual work. Several systems have been proposed to increase automation and scalability in the process. They rely on a formal, declarative approach based on first order logic: users provide high-level specifications of their tasks, and the systems compute optimal solutions without human intervention on the generated code. However, traditional ‘top-down’ cleaning approaches quickly become unpractical when dealing with the complexity and variety found in big data. In this talk, we first describe recent results in tackling data cleaning with a declarative approach. We then discuss how this experience has pushed several groups to propose new systems that recognize the central role of the users in cleaning big data.

Bio

Paolo Papotti is an Assistant Professor of Computer Science in the School of Computing, Informatics, and Decision Systems Engineering (CIDSE) at Arizona State University. He got his Ph.D. in Computer Science at Universita’ degli Studi Roma Tre (2007, Italy) and before joining ASU he had been a senior scientist at Qatar Computing Research Institute.

His research is focused on systems that assist users in complex, necessary tasks and that scale to large datasets with efficient algorithms and distributed platforms. His work has been recognized with two “Best of the Conference” citations (SIGMOD 2009, VLDB 2015) and with a best demo award at SIGMOD 2015. He is group leader for SIGMOD 2016 and associate editor for the ACM Journal of Data and Information Quality (JDIQ).

Tuesday, February 2, 2016

Seminar by Meghyn Bienvenu on February 26, 2016

Meghyn Bienvenu (CNRS/U. Montpellier, http://www.lirmm.fr/~meghyn/) will present her tutorial on "Ontology-Mediated Query Answering"
(http://www.csw.inf.fu-berlin.de/rw2015/lecturers.html#QueryAnswering)

When: Friday 26/2/2016, from 10 am to 12 am and from 2 pm to 4 pm
Where: Salle Gilles Kahn
Inria Saclay Île-de-France
Bâtiment Alan Turing
1 rue Honoré d'Estienne d'Orves
Campus de l'École Polytechnique
91120 Palaiseau
Coordonnées GPS :
+48° 42' 52.11", +2° 12' 20.78"
How to get there:
http://www.inria.fr/en/centre/saclay/overview/practical-info/how-to-reach-the-centre

The closest RER station is Lozère (RER B). If you need help getting from Lozère to the seminar, contact me (ioana.manolescu@inria.fr).

Wednesday, January 13, 2016

Seminars by Julia Stoyanovich and Benny Kimelfeld at Télécom ParisTech (21 January 2016)

Julia Stoyanovich (Drexel University) and Benny Kimelfeld (Technion) will give talks on 21 January 2016 at Télécom ParisTech, 46 rue Barrault, Paris, 14:00 in Amphi Saphir.

Portal: A query language for evolving graphs

Julia Stoyanovich, Drexel University, Philadelphia, PA, U.S.A.

Graphs are used to represent a plethora of phenomena, from the Web and social networks, to biological pathways, to semantic knowledge bases. Arguably the most interesting and important questions one can ask about graphs have to do with their evolution. Which Web pages are showing an increasing popularity trend? How does influence propagate in social networks? How does knowledge evolve? In this talk I will present Portal, a declarative language for efficient querying and exploratory analysis of evolving graphs. I will describe an implementation of Portal in scope of Apache Spark, an open-source distributed

data processing framework, and will demonstrate that careful engineering can lead to good performance. Finally, I will describe our work on a visual query composer for Portal.

Julia Stoyanovich is an Assistant Professor of Computer Science at the College of Computing and Informatics at Drexel University (Philadelphia, USA). Prior to joining Drexel, she was a Postdoctoral researcher and an NSF/CRA Computing Innovations Fellow at the University of Pennsylvania. Julia received her MS and PhD degrees in Computer Science at Columbia University (New York, USA) in 2003 and 2009, respectively, and her BS in Computer Science and in Mathematics and Statistics at the University of Massachusetts Amherst, USA in 1998. Having graduated from college, Julia spent 5 years in the start-up industry, as a software developer, data architect and database administrator. This experience has motivated her to work with real datasets whenever possible, and to deliver results of her research to the communities of target users, as part of open-source systems or as stand-alone prototypes. Julia's research is in the area of data and knowledge management. Her focus is on developing novel information discovery approaches, with the goal of helping the user identify relevant information, and ultimately transform that information into knowledge. She has recently worked with a wide variety of real datasets, from shopping, dating and collaborative tagging applications, to full-genome association studies and gene expression microarrays, to data-intensive workflows and scientific articles. For more information, see https://www.cs.drexel.edu/~julia/

Database Principles in the Wild

Benny Kimelfeld, Technion, Haifa, Israel

Modern technological and social trends, such as mobile computing, blogging, and social networking, produce an enormous amount of often valuable data. At the same time, the means to analyze such data are becoming more accessible with the popularity of business models like cloud computing, open source and crowd sourcing. But such data pose challenges to traditional database paradigms. Due to the uncontrolled nature by which data is produced, much of it is free text, often in informal natural language, leading to computing environments with high levels of uncertainty and error. In this talk I will describe principled research that I have been pursuing towards systems that facilitate modern data-centric development by unifying key functionalities of databases, text analytics, machine learning and artificial intelligence.

Benny Kimelfeld is an Associate Professor in the Computer Science Faculty at Technion, Israel. After receiving his Ph.D. from The Hebrew University of Jerusalem, he has been a Research Staff Member at IBM Research Almaden, and a Computer Scientist at LogicBlox. Benny's research spans a spectrum of both foundational and systems aspects of data management, such as probabilistic and inconsistent databases, information retrieval over structured data, and infrastructure for text analytics. Benny was an invited tutorial speaker at PODS 2014 and a co-chair of the first SIGMOD/PODS workshop on Big Uncertain Data (BUDA). He is a co-chair of the 2016 Web and Databases Workshop (WebDB'16), and he currently serves as an associate editor in the Journal of Computer and System Sciences (JCSS). enny is a Taub Fellow at Technion, and his research is funded by the Israel Science Foundation (ISF), the United States - Israel Binational Science Foundation (BSF), and DARPA. For more information, see http://www.cs.technion.ac.il/people/bennyk/