Thursday, March 12, 2015

Helena Galhardas: Speeding up information extraction programs

Who: Helena Galhardas
When: March  27, 2 pm
Where: PCRI, room 445 (see also Access to PCRI)

Title: Speeding up information extraction programs: a holistic optimizer and a learning-based approach to rank documents

Abstract: 

A wealth of information produced by individuals and organizations is expressed in natural language text. Text lacks the explicit structure that is necessary to support rich querying and analysis. Information extraction systems are sophisticated software tools to discover structured information in natural language text. Unfortunately, information extraction is a challenging and time-consuming task.

In this talk, I will first present our proposal to optimize information extraction programs. It consists of a holistic approach that focuses on: (i) optimizing all key aspects of the information extraction process collectively and in a coordinated manner, rather than focusing on individual subtasks in isolation; (ii) accurately predicting the execution time, recall, and precision for each information extraction execution plan; and (iii) using these predictions to choose the best execution plan to execute a given information extraction program.

Then, I will briefly present a principled, learning-based approach for ranking documents according to their potential usefulness for an extraction task. Our online learning-to-rank methods exploit the information collected during extraction, as we process new documents and the fine-grained characteristics of the useful documents are revealed. Then, these methods decide when the ranking model should be updated, hence significantly improving the document ranking quality over time.

This is joint work with Gonçalo Simões, INESC-ID and IST/University of Lisbon, and Pablo Barrio and Luis Gravano from Columbia University, NY.

Pour en savoir plus : http://web.ist.utl.pt/helena.galhardas/

Yanlei Diao: Big Data Analytics for Large-Scale Scientific Applications

Who: Yanlei Diao
When: March 17, 14:30
Where: Univ. Paris Sud, bâtiment 660, amphi Claude Shannon, Rue Noetzlin, 91190 Gif-sur-Yvette
(The 660 building on Google Maps)
See also access to PCRI.

Title: Big Data Analytics for Large-Scale Scientific Applications

Abstract:

As scientific applications are producing data at an unprecedented rate, they have become a main driving force of the big data field.
Meanwhile, intelligent, scalable data management has become crucial to large-scale scientific applications such as computational astrophysics and genomics.  In this talk, I present our recent work on platform and algorithm design to support such applications.

First, I show how we design a new storage system, Claro, based on the recently proposed array model, to store and process scientific data that are inherently noisy and uncertain. We propose a suite of storage and evaluation strategies to support array operations under data uncertainty. Results from Sloan Digital Sky Survey (SDSS) datasets show that our techniques outperform state-of-the-art index
methods by 1.7x-4.3x for the Subarray operation and 1-2 orders of magnitude for Structure-Join.

Second, motivated by the needs of low-latency genomic data processing, I present our design of a “big and fast” data analytics system, Scalla.
Scalla achieves scalability and low-latency (real-time) of processing in a unified system by seamlessly integrating data parallelism, incremental processing, and distributed resource planning. Scalla outperforms existing fast data systems by 1-2 orders of magnitude in throughput and latency combined. Finally, I show some initial results of applying Scalla in the genomics domain.

Bio:

Yanlei Diao is Associate Professor of Computer Science at the University of Massachusetts Amherst. Her research interests are in information architectures and data management systems, with a focus on
big data analytics, scientific analytics, data streams, uncertain data management, and RFID and sensor data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005, her M.S. in Computer Science from the Hong Kong University of Science and Technology in 2000, and her B.S. in Computer Science from Fudan University in 1998.

Yanlei Diao was a recipient of the 2013 CRA-W Borg Early Career Award (one female computer scientist selected each year), IBM Scalable Innovation Faculty Award, and NSF Career Award, and she was a finalist of the Microsoft Research New Faculty Award.
She spoke at the Distinguished Faculty Lecture Series at the University of Texas at Austin. Her PhD dissertation “Query Processing for Large-Scale XML Message Brokering” won the 2006 ACM-SIGMOD Dissertation Award Honorable Mention.
She is currently Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, Area Chair of SIGMOD 2015, and member of the SIGMOD Executive Committee and SIGMOD Software Systems Award Committee.
In the past, she has served as Associate Editor of PVLDB, organizing committee member of SIGMOD, CIDR, DMSN, and the New England Database Summit, as well as on the program
committees of many international conferences and workshops.
Her research has been strongly supported by industry with awards from Google, IBM, Cisco, NEC labs, and the Advanced Cybersecurity Center.

Cristina Sirangelo: Querying Incomplete Data

When: Thursday, March 12, at 10:30am

Where: PCRI building, room 455

Who: Cristina Sirangelo

Title: Querying incomplete data

Abstract:
Data is incomplete when it contains missing/unknown information, or more generally when it is only partially available, e.g. because of restrictions on data access.

Incompleteness is receiving a renewed interest as it is naturally generated in data interoperation, a very common framework for today's data-centric applications. In this setting data is decentralized, needs to be integrated from several sources and exchanged between different applications. Incompleteness arises from the semantic and syntactic heterogeneity of different data sources.

Querying incomplete data is usually an expensive task. In this talk we survey on the state of the art and recent developments on the tractability of querying incomplete data, under different possible interpretations of incompleteness.