Wednesday, December 9, 2015

Jennifer Widom's talk at Télécom ParisTech (28th January 2016)

Seminar – Jennifer Widom, Stanford University

Three Favorite Results

Thursday, January 28th 2016 at Telecom ParisTech, 46 rue Barrault, 75013 Paris
Amphi B 312 – 10:00 am.
 
Registration is free but compulsory, by filling a form at https://bdmi.wp.mines-telecom.fr/2015/12/03/seminar-jennifer-widom-stanford-university/ 

Conventional wisdom says good things come in threes. As an exercise recently, I reflected on the research I’ve conducted over my career to date and selected my three favorite results, which I will cover in this talk. For each one I’ll explain the context and motivation, the result itself, and why it ranks as one of my favorites. I’ll also make an attempt to decipher what the results have in common. The three results span computer science foundations, system implementation, and user interface questions, and they represent three of my favorite research areas: semistructured data, data streams, and uncertain data.

 
Jennifer Widom is the Fletcher Jones Professor of Computer Science and Electrical Engineering at Stanford University, and the Senior Associate Dean for Faculty and Academic Affairs in Stanford’s School of Engineering. She served as chair of the Computer Science Department from 2009-2014. Jennifer received her Bachelor’s degree from the Indiana University Jacobs School of Music in 1982 and her Computer Science Ph.D. from Cornell University in 1987. She was a Research Staff Member at the IBM Almaden Research Center before joining the Stanford faculty in 1993. Her research interests span many aspects of nontraditional data management. She is an ACM Fellow and a member of the National Academy of Engineering and the American Academy of Arts & Sciences; she received the ACM-W Athena Lecturer Award in 2015, the ACM SIGMOD Edgar F. Codd Innovations Award in 2007, and a Guggenheim Fellowship in 2000. She has served on a variety of program committees, advisory boards, and editorial boards.

Paris Big Data Management Summit 2016

Paris Big Data Management Summit 2016
March 24th, 2016, Paris, France
—————————————————————————————————————
Venue:   Télécom ParisTech
             46, rue Barrault - 75013 Paris


This is a call for submission and participation for the Inaugural Paris Big Data Management Summit, to be held on March 24, 2016. The goal of this all-day summit is to bring together researchers from the greater Paris area with an interest in big data management, together with select industry experts, to discuss our collective research strengths and look for opportunities for future collaborations. This summit will show case a number of research projects of high relevance and impact, and present a plenary student poster session to broadly cover projects on big data management in the local area. Attendees will also hear from French industry about their data management needs. 


Call for Submission
———————————————————
We call for submission from all researchers and graduate students in the greater Paris area and other select areas in France. The topics of interest include, but are not limited to:
- Databases, 
- Data mining, 
- Web data management,  
- Knowledge management, and 
- Broader big data analytics.

We call for submissions from researchers and students in one of the two forms:

- Technical talk: Each technical talk is given 15 minutes for presentation, and there will be 6-9 talks at the summit. To submit for a technical talk, we require a 2 page maximum PDF talk abstract (any format, 10 pt font or larger). The organizing committee will review the talk abstracts to make the final selection.

- Poster: We anticipate every research project on big data management in the local area to be presented with a poster. It is a great opportunity for the graduate students on the project to present their ideas and latest results. For a poster submission, we require a 1 paragraph abstract. All posters on related topics will be accepted.

All presentations must be made in English due to the presence of international participants.  

For submission, please visit the following web page: 



Important Dates
——————————————————
Paper or poster submission:  January 8, 2016
Author notification:     January 22, 2016
Registration deadline:     February 29, 2016 (registration may be closed earlier when reaching the maximum 
                                        capacity of the summit venue)
Summit date:             March 24, 2016

All deadlines are 23:59, Paris time, on the due day.


Event details
——————————————————
The 2016 Paris Big Data Management Summit will be held on March 24, 2016, from 8:30 am to 6:00 pm. The program will consist of keynote speeches, a number of technical talks from you, the participants, and from our industry partners, and finally, a large student poster session and a cocktail social!

Registration is free. Lunch, drinks, and appetizers will be provided.

The event will be held at Telecom ParisTech, located at 46, rue Barrault - 75013 Paris. Transportation options include:
- Metro : Take the line 6 and stop at Corvisart station
- RER : Take the RER B until Denfert-Rochereau station. Connection in Denfert Rochereau and take the Metro line 6
- Bus : Line 62 (Vergniaud), 21 (Daviel) or 67 (Bobillot)
- Vélib' : Stations 13022 (27 & 36, rue de la Butte aux Cailles), 13048 (20, rue Wurtz) or 13024 (81, rue Bobillot)
- Autolib' : 245, rue de Tolbiac - 189, rue de Tolbiac - 50, bd. Blanqui


——————————————————

For more information, please visit our website:

Monday, June 1, 2015


Hubie Chen: "One Hierarchy Spawns Another: Graph Deconstructions and the Complexity Classification of Conjunctive Queries", LSV ENS Cachan, June 11th 2015, 10.30 am


Name: Hubie Chen
Title: One Hierarchy Spawns Another: Graph Deconstructions and the Complexity Classification of Conjunctive Queries
Date and time: Thursday, June 11th 2015, at 10.30 am
Location: LSV library, ENS Cachan http://www.lsv.ens-cachan.fr/

Abstract: 
We study the classical problem of conjunctive query evaluation, here restricted according to the set of permissible queries.  In this work, this problem is formulated as the relational homomorphism problem over a set of structures A, wherein each instance must be a pair of structures such that the first structure is an element of A. We present a comprehensive complexity classification of these problems, which strongly links graph-theoretic properties of A to the complexity of the corresponding homomorphism problem. In particular, we define a binary relation on graph classes and completely describe the resulting hierarchy given by this relation. This binary relation is defined in terms of a notion which we call graph deconstruction and which is a variant of the well-known notion of tree decomposition. We then use this graph hierarchy to infer a complexity hierarchy of homomorphism problems which is comprehensive up to a computationally very weak notion of reduction, namely, a parameterized form of quantifier-free reductions. We obtain a significantly refined complexity classification of left-hand side restricted homomorphism problems, as well as a unifying, modular, and conceptually clean treatment of existing complexity classifications, such as the classifications by Grohe-Schwentick-Segoufin (STOC 2001) and Grohe (FOCS 2003, JACM 2007).

In this talk, we will also briefly discuss parameterized complexity classes that we introduced/studied which capture  some of the complexity degrees identified by our classification.

This talk is based on joint work with Moritz Mûller that appeared in PODS ’13 and CSL-LICS ’14.

Wednesday, May 6, 2015

Meng “Jason” Changping, Luis Galárraga: Mining Knowledge from Knowledge Base Networks


Who: Meng “Jason” Changping, Luis Galárraga
When: May 11, 2015, 5pm
Where: Télécom ParisTech (46 rue Barrault, 75013 Paris), Amphi Jade
Seminar formed of two talks on mining knowledge from knowledge base networks, organized in the setting of Télécom ParisTech research chair on Machine Learning and Big Data.

Who: Luis Galárraga, Télécom ParisTech
Title: Applications of Rule Mining in Knowledge Bases
Abstract: The continuous progress of Information Extraction (IE) techniques has  led to the construction of large Knowledge Bases (KBs) containing facts about millions of entities  such as  people, organizations and places. KBs are important nowadays because  they allow  computers to understand the real world and are used in multiple domains and applications. Furthermore, the  discovery of useful and  non-trivial patterns  in  KBs,   known as rule  mining, opens the door for multiple  applications in  the areas of data analysis, prediction and automatic data engineering. In this article we present an overview of our ongoing work on rule mining on  KBs   and some of its applications. The scale of current KBs as well as their inherent incompleteness and noise make this endeavor challenging.
Who: Meng “Jason” Changping, PhD candidate, Purdue University
Title: Discovering Meta-Paths in Large Heterogeneous Information Networks
Abstract: The Heterogeneous Information Network (HIN) is a graph data model in which nodes and edges are  annotated with class  and relationship labels.  Large and complex datasets, such as Yago or DBLP, can be modeled as HINs. Recent work has studied  how to make  use of these  rich information sources.  In particular, meta-paths, which represent sequences of node classes and edge types between two nodes  in a HIN,  have been proposed  for such tasks  as information  retrieval,  decision  making,  and  product   recommendation. Current methods assume meta-paths are found by domain experts. However, in a large and complex HIN, retrieving meta-paths manually can be tedious and difficult.  We  thus  study  how  to  discover  meta-paths  automatically. Specifically, users  are asked  to  provide example  pairs of  nodes  that exhibit high proximity. We then investigate how to generate  meta-paths that can best explain  the relationship  between   these   node   pairs.  Since   this   problem   is computationally intractable, we propose a  greedy algorithm to select  the most relevant  meta-paths. We  also  present a  data structure  to  enable efficient execution of this algorithm. We further incorporate hierarchical relationships among node classes  in our solutions. Extensive  experiments on real-world HIN show that our approach captures important meta-paths  in an efficient and scalable manner.

Thursday, March 12, 2015

Helena Galhardas: Speeding up information extraction programs

Who: Helena Galhardas
When: March  27, 2 pm
Where: PCRI, room 445 (see also Access to PCRI)

Title: Speeding up information extraction programs: a holistic optimizer and a learning-based approach to rank documents

Abstract: 

A wealth of information produced by individuals and organizations is expressed in natural language text. Text lacks the explicit structure that is necessary to support rich querying and analysis. Information extraction systems are sophisticated software tools to discover structured information in natural language text. Unfortunately, information extraction is a challenging and time-consuming task.

In this talk, I will first present our proposal to optimize information extraction programs. It consists of a holistic approach that focuses on: (i) optimizing all key aspects of the information extraction process collectively and in a coordinated manner, rather than focusing on individual subtasks in isolation; (ii) accurately predicting the execution time, recall, and precision for each information extraction execution plan; and (iii) using these predictions to choose the best execution plan to execute a given information extraction program.

Then, I will briefly present a principled, learning-based approach for ranking documents according to their potential usefulness for an extraction task. Our online learning-to-rank methods exploit the information collected during extraction, as we process new documents and the fine-grained characteristics of the useful documents are revealed. Then, these methods decide when the ranking model should be updated, hence significantly improving the document ranking quality over time.

This is joint work with Gonçalo Simões, INESC-ID and IST/University of Lisbon, and Pablo Barrio and Luis Gravano from Columbia University, NY.

Pour en savoir plus : http://web.ist.utl.pt/helena.galhardas/

Yanlei Diao: Big Data Analytics for Large-Scale Scientific Applications

Who: Yanlei Diao
When: March 17, 14:30
Where: Univ. Paris Sud, bâtiment 660, amphi Claude Shannon, Rue Noetzlin, 91190 Gif-sur-Yvette
(The 660 building on Google Maps)
See also access to PCRI.

Title: Big Data Analytics for Large-Scale Scientific Applications

Abstract:

As scientific applications are producing data at an unprecedented rate, they have become a main driving force of the big data field.
Meanwhile, intelligent, scalable data management has become crucial to large-scale scientific applications such as computational astrophysics and genomics.  In this talk, I present our recent work on platform and algorithm design to support such applications.

First, I show how we design a new storage system, Claro, based on the recently proposed array model, to store and process scientific data that are inherently noisy and uncertain. We propose a suite of storage and evaluation strategies to support array operations under data uncertainty. Results from Sloan Digital Sky Survey (SDSS) datasets show that our techniques outperform state-of-the-art index
methods by 1.7x-4.3x for the Subarray operation and 1-2 orders of magnitude for Structure-Join.

Second, motivated by the needs of low-latency genomic data processing, I present our design of a “big and fast” data analytics system, Scalla.
Scalla achieves scalability and low-latency (real-time) of processing in a unified system by seamlessly integrating data parallelism, incremental processing, and distributed resource planning. Scalla outperforms existing fast data systems by 1-2 orders of magnitude in throughput and latency combined. Finally, I show some initial results of applying Scalla in the genomics domain.

Bio:

Yanlei Diao is Associate Professor of Computer Science at the University of Massachusetts Amherst. Her research interests are in information architectures and data management systems, with a focus on
big data analytics, scientific analytics, data streams, uncertain data management, and RFID and sensor data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005, her M.S. in Computer Science from the Hong Kong University of Science and Technology in 2000, and her B.S. in Computer Science from Fudan University in 1998.

Yanlei Diao was a recipient of the 2013 CRA-W Borg Early Career Award (one female computer scientist selected each year), IBM Scalable Innovation Faculty Award, and NSF Career Award, and she was a finalist of the Microsoft Research New Faculty Award.
She spoke at the Distinguished Faculty Lecture Series at the University of Texas at Austin. Her PhD dissertation “Query Processing for Large-Scale XML Message Brokering” won the 2006 ACM-SIGMOD Dissertation Award Honorable Mention.
She is currently Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, Area Chair of SIGMOD 2015, and member of the SIGMOD Executive Committee and SIGMOD Software Systems Award Committee.
In the past, she has served as Associate Editor of PVLDB, organizing committee member of SIGMOD, CIDR, DMSN, and the New England Database Summit, as well as on the program
committees of many international conferences and workshops.
Her research has been strongly supported by industry with awards from Google, IBM, Cisco, NEC labs, and the Advanced Cybersecurity Center.

Cristina Sirangelo: Querying Incomplete Data

When: Thursday, March 12, at 10:30am

Where: PCRI building, room 455

Who: Cristina Sirangelo

Title: Querying incomplete data

Abstract:
Data is incomplete when it contains missing/unknown information, or more generally when it is only partially available, e.g. because of restrictions on data access.

Incompleteness is receiving a renewed interest as it is naturally generated in data interoperation, a very common framework for today's data-centric applications. In this setting data is decentralized, needs to be integrated from several sources and exchanged between different applications. Incompleteness arises from the semantic and syntactic heterogeneity of different data sources.

Querying incomplete data is usually an expensive task. In this talk we survey on the state of the art and recent developments on the tractability of querying incomplete data, under different possible interpretations of incompleteness.

Thursday, January 29, 2015

Nicoleta Preda: ANGIE in Wonderland, Friday, February 13, 2 pm, PCRI, room 445


When: Friday, February 13, at 14:00

Where: PCRI building, room 445 (address)

Who: Nicoleta Preda

Title: ANGIE in wonderland

Abstract:
In recent years, several important content providers such as Amazon,Musicbrainz, IMDb, Geonames, Google, and Twitter, have chosen to
export their data through Web services. To unleash the potential of
these sources for new intelligent applications, the data has to be
combined across different APIs.
To this end, we have developed ANGIE, a framework that maps the
knowledge provided by Web services dynamically into a local knowledge
base. ANGIE represents Web services as views with binding patterns
over the schema of the knowledge base. In this talk, I will focus on
two problems related to our framework.

In the first part, the focus will be on the automatic integration of
new Web services. I will present a novel algorithm for inferring the
view definition of a given Web service in terms of the schema of the
global knowledge base. The algorithm also generates a declarative
script can transform the call results into results of the view. Our
experiments on real Web services show the viability of our approach.

The second part will address the evaluation of conjunctive queries
under a budget of calls. Conjunctive queries may require an unbound
number of calls in order to compute the maximal answers. However, Web
services typically allow only a fixed number of calls per session.
Therefore, we have to prioritize query evaluation plans. We are working 
on distinguishing among all plans that could return answers those plans
that actually will. Finally, I will show an application for this new notion of plans. 

Short Bio:
Nicoleta Preda obtained her Ph.D. in computer science from the University Paris-Sud under the supervision of Serge Abiteboul and Ioana Manolescu. Before joining the University of Versailles in 2010, she was a post-doctoral researcher in the database group led by Gerhard Weikum at the Max Planck Institute for Informatics. Her research interests include the enrichment of KBs with dynamic data, rule mining, and querying large repositories of semi-structured data. Nicoleta teaches classes on data integration, database systems, XML technologies, and Web services.

Monday, January 19, 2015

Paolo Papotti: Beyond declarative mapping and cleaning, Feb 2, 2015, 2 pm, PCRI, room 445

When: Monday, February 2, at 14:00

Where: PCRI building, room 445

Who: Paolo Papotti

Title: Beyond declarative mapping and cleaning

Abstract:
In the "big data" era, data integration is a popular activity both in academia and in industry. Integrating hundreds of heterogeneous sources on a daily basis requires a great amount of manual work in order to have data that is polished enough to be useful in the final applications, such as querying and mining. The problem is ever harder in practice, as data is often dirty in nature because of typos, duplicates, and so on, that can lead to poor results in the analytic tasks.

Over the last ten years, several successful systems have been proposed to tackle this challenge with a formal, declarative approach based on first order logic. However, despite the positive results, there is still a gap between these proposals and the leading commercial systems. The latter are harder to maintain, to debug, and to test, but provide the level of personalization and detail that are needed to solve “real-world” problems. In this talk, I will describe some of my results in tackling mapping and cleaning with a declarative approach, and how this experience has pushed me to explore a new way that can take the best of both worlds.

Short Bio:
Paolo Papotti is a scientist in the Data Analytics center at Qatar Computing Research Institute (QCRI). He holds a Ph.D degree in computer science from Roma Tre University (Italy, 2007), where he also was Assistant Professor before joining QCRI. He had visiting appointments at IBM Almaden (USA) and at the UC Santa Cruz (USA). His research topics are in the general area of information integration and data quality.

Tuesday, January 6, 2015

Yanlei Diao: "Supporting Scientific Analytics under Data Uncertainty and Query Uncertainty", PCRI, Jan 16, 2015, 10 am

Title: Supporting Scientific Analytics under Data Uncertainty and Query Uncertainty
 
Location: PCRI (https://www.lri.fr/info.pratiques.php), room 455

Date and time: January 16, 2015, 10 am
Abstract:

Data management is becoming increasingly important in large-scale scientific applications such as computational astrophysics, severe weather monitoring, and genomics.  In this talk, I present our recent work to address two major challenges raised by those scientific applications. The first challenge regards “data uncertainty”, due to the fact that scientific measurements are inherently noisy and uncertain. In particular, we address uncertain data management under the array model, which has gained popularity for large-scale scientific data processing due to performance benefits. We propose a suite of storage and evaluation strategies to support array operations under data uncertainty. Results from Sloan Digital Sky Survey (SDSS) datasets show that our techniques outperform state-of-the-art methods by 1.7x to 4.3x for the Subarray operation and 1 to 2 orders of magnitude for Structure-Join.
As scientific data continues to grow in size and diversity, it is becoming harder for the user to express her data interests precisely in a formal language like SQL. We refer to this second problem as “query uncertainty. This leads to a strong need for “interactive data exploration,” a service that efficiently navigates the user through a large data space to identify the objects of interest. We present our initial work on interactive data exploration, with results suggesting that it is possible to predict user interests modeled by conjunctive queries with a small number of samples, while providing interactive performance.

Bio:
Yanlei Diao is Associate Professor of Computer Science at the University of Massachusetts Amherst. Her research interests are in information architectures and data management systems, with a focus on big data analytics, scientific analytics, data streams, uncertain data management, and RFID and sensor data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005, her M.S. in Computer Science from the Hong Kong University of Science and Technology in 2000, and her B.S. in Computer Science from Fudan University in 1998. 

Yanlei Diao was a recipient of the 2013 CRA-W Borg Early Career Award (one female computer scientist selected each year), IBM Scalable Innovation Faculty Award, and NSF Career Award, and she was a finalist of the Microsoft Research New Faculty Award. She spoke at the Distinguished Faculty Lecture Series at the University of Texas at Austin. Her PhD dissertation “Query Processing for Large-Scale XML Message Brokering” won the 2006 ACM-SIGMOD Dissertation Award Honorable Mention. She is currently Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, Area Chair of SIGMOD 2015, and member of the SIGMOD Executive Committee and SIGMOD Software Systems Award Committee. In the past, she has served as Associate Editor of PVLDB, organizing committee member of SIGMOD, CIDR, DMSN, and the New England Database Summit, as well as on the program committees of many international conferences and workshops. Her research has been strongly supported by industry with awards from Google, IBM, Cisco, NEC labs, and the Advanced Cybersecurity Center.

Thursday, January 1, 2015

Access to PCRI

Physical Address

Our building’s physical address is:
Université  Paris-Sud 11
Bâtiment 650 (PCRI)
Rue Noetzlin,91190 Gif-sur-Yvette
France
.

GPS coordinates: 48.712346, 2.168362

Directions using public transportation: two alternatives

  • (Currently most efficient using public transportation: ) Taking RER line B towards Saint-Rémy-lès-Chevreuse, getting off at Massy-Palaiseau then taking a 91.06 bus (either 91.06 B, 91.06 C, never 91.06 A -  91.10 might work: ask the driver if it gets down at the stop) to IUT – Pôle d’Ingénierie and do the last 150m on foot [map]. To find the bus stop in Massy: [map].
  • Taking RER line B towards Saint-Rémy-lès-Chevreuse, getting off at Le Guichet, then either
    • take the bus: Once you get out at Le Guichet (coming from Paris or the Parisian airports), cross under the tracks, exit the station, go along the corner and the café, cross the street, go down the stairs, cross again to the bus station [map]. You must take the bus 9. The bus schedules are available here. Get off at “IUT – Pôle d’ingénierie” (first stop). The bus ride is 4 minutes. 
    • come on foot: Coming from Paris, start by crossing the rails by the underground pass. Then, take Rue de Versailles (perpendicular to the rails) in front of the train station for two blocks. Turn left in Rue de la Colline, which goes uphill. When you reach the end of that street (almost at the top), continue right in Chemin du Bois des Rames. Keep going on that direction into Rue Nicolas Appert. Turn left in Rue d’Arsonval and continue until joining Rue Noetzlin. The nearest bus station is “Moulon“. Assume 25-30 minutes walk depending on your walking speed.