Tutorials

Efficient Big Data Processing in Hadoop MapReduce

Abstract: This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. A popular data processing engine for big data is Hadoop MapReduce. Early versions of Hadoop MapReduce suffered from severe performance problems. Today, this is becoming history. There are many techniques that can be used with Hadoop MapReduce jobs to boost performance by orders of magnitude. In this tutorial we teach such techniques. First, we will briefly familiarize the audience with Hadoop MapReduce and motivate its use for big data processing. Then, we will focus on different data management techniques, going from job optimization to physical data organization like data layouts and indexes. Throughout this tutorial, we will highlight the similarities and differences between Hadoop MapReduce and Parallel DBMS. Furthermore, we will point out unresolved research problems and open issues.

Jens Dittrich (@jensdittrich) is an Associate Professor of Computer Science/Databases at Saarland University, Germany. Previous affiliations include U Marburg, SAP AG, and ETH Zurich. He received an Outrageous Ideas and Vision Paper Award at CIDR 2011, a CS teaching award for database systems, as well as several presentation and science slam awards. His research focuses on fast access to big data.

Jorge-Arnulfo Quiané-Ruiz is a research associate at Saarland University, Germany. His research interests include databases, distributed data management, and big data analytics. From October 2008 to April 2009, he worked as research engineer at INRIA with Patrick Valduriez. He was awarded with a fellowship from the Mexican National Council of Technology (CONACyT) to do his Ph.D. in Computer Science at INRIA and the University of Nantes, France, where he received his Ph.D. diploma in September 2008. He received a M.Sc. in Computer Science with speciality in Distributed Systems from Joseph Fourier University, Grenoble, France, in July 2004. He obtained, with highest honors, an M.Sc. in Computer Science from the National Polytechnic Institute, Mexico, in August 2003.

MapReduce Algorithms for Big Data Analysis

Abstract: There is a growing trend of applications that should handle big data. However, analyzing big data is a very challenging problem today. For such applications, the MapReduce framework has recently attracted a lot of attention. MapReduce is a programming model that allows easy development of scalable parallel applications to process big data on large clusters of commodity machines. Google's MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications. In this tutorial, we will introduce the MapReduce framework based on Hadoop, discuss how to design efficient MapReduce algorithms and present the state-of-the-art in MapReduce algorithms for data mining, machine learning and similarity joins. The intended audience of this tutorial is professionals who plan to design and develop MapReduce algorithms and researchers who should be aware of the state-of-the-art in MapReduce algorithms available today for big data
analysis.

Kyuseok Shim is currently a Full Professor at Seoul National University in Korea. Before that, he was an Assistant Professor at KAIST and a member of technical staff for the Serendip Data Mining Project at Bell Laboratories. He was also a member of the Quest Data Mining Project at the IBM Almaden Research Center. He received the BS degree in Electrical Engineering from Seoul National University in 1986, and the MS and PhD degrees in Computer Science from the University of Maryland at College Park, in 1988 and 1993, respectively. He has been working in the areas of databases and data mining. His writings have appeared in a number of professional conferences and journals including ACM, VLDB and IEEE publications. He served previously on the editorial board of the VLDB and TKDE Journals. He also served as a PC member for SIGKDD, SIGMOD, ICDE, ICDM, PAKDD, VLDB, and WWW conferences.

Entity Resolution: Theory, Practice and Open Challenges

Abstract: Entity resolution (ER), the problem of extracting, matching and resolving entity mentions in structured and unstructured data, is a long-standing challenge in database management, information retrieval, machine learning, natural language processing and statistics. Accurate and fast entity resolution has huge practical implications in a wide variety of commercial, scientific and security domains. Despite the long history of work on entity resolution, there is still a surprising diversity of approaches, and lack of guiding theory. Meanwhile, in the age of big data, the need for high quality entity resolution is growing, as we are inundated with more and more data, all of which needs to be integrated, aligned and matched, before further utility can be extracted. In this tutorial, we bring together perspectives on entity resolution from a variety of fields, including databases, information retrieval, natural language processing and machine learning, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges and open research problems. In addition to giving attendees a thorough understanding of existing ER models, algorithms and evaluation methods, the tutorial will cover important research topics such as scalable ER, active and lightly supervised ER, and query-driven ER.

Lise Getoor is an associate professor in the Computer Science Department at the University of Maryland, College Park. Her primary research interests are in machine learning and reasoning with uncertainty, applied to structured and semi-structured data. She also works in data integration, social network analysis and visual analytics. She has published numerous articles in machine learning, data mining, database, and artificial intelligence forums. She received her PhD from Stanford University in 2001. She was awarded an NSF Career Award, has served as action editor for the Machine Learning Journal, JAIR associate editor, and TKDD associate editor. She is a board member of the International Machine Learning Society, has been a member of AAAI Executive council, was PC co-chair of ICML 2011, and has served on a variety of program committees including AAAI, ICML, IJCAI, ISWC, KDD, SIGMOD, UAI, VLDB, and WWW.

Ashwin Machanavajjhala is an Assistant Professor in the Department of Computer Science, Duke University. Previously, he was a Senior Research Scientist in the Knowledge Management group at Yahoo! Research. His primary research interests lie in data privacy and security, big-data management and statistical methods for information extraction and entity resolution. Ashwin graduated with a Ph.D. from the Department of Computer Science, Cornell University. His thesis work on defining and enforcing privacy was awarded the 2008 ACM SIGMOD Jim Gray Dissertation Award Honorable Mention. He has also received an M.S. from Cornell University and a B.Tech in Computer Science and Engineering from the Indian Institute of Technology, Madras.

I/O Characteristics of NoSQL Databases

Abstract: The proliferation of the so-called NoSQL databases in the last few years has brought about a new model of using storage systems. While traditional relational database systems took advantage of features offered by centrally-managed, enterprise-class frame arrays, the new generation of database systems with weaker data consistency models is content with using and managing locally attached individual storage devices and providing data reliability and availability through high-level software features and protocols.

This tutorial aims to review the architecture of several existing NoSQL DBs with an emphasis on how they organize and access data in the shared-nothing locally-attached storage model. It shows how these systems operate under typical workloads (new inserts and point and range queries), what access characteristics they exhibit to storage systems. The tutorial examines how several recently developed key/value stores, schema-free document storage (e.g., MongoDB or CouchDB), and extensible column stores (HBase or Cassandra), organize data on local filesystems on top of directly-attached disks and what system features they must (re)implement in order to provide the expected data reliability in the face of component and node failures.

Jiri Schindler is a member of technical staff at the NetApp Advanced Technology Group where he works on storage architectures integrating flash memory and disk drives in support of applications for management of (semi)structured data. While getting his PhD at Carnegie Mellon University, he and his colleagues designed and built the Fates (Clotho, Atropos, and Lachesis) system for efficient execution of mixed database workloads with different I/O profiles. Jiri is also an adjunct professor at the Northeastern University where he teaches storage systems classes.

Secure and Privacy-Preserving Data Services in the Cloud: A Data Centric View

Abstract: Cloud computing becomes a successful paradigm for data computing and storage. Increasing concerns about data security and privacy in the cloud, however, have emerged. Ensuring security and privacy for data management and query processing in the cloud is critical for better and broader uses of the cloud. This tutorial covers some common cloud security and privacy threats and the relevant research, while focusing on the works that protect data confidentiality and query access privacy for sensitive data being stored and queried in the cloud. We provide a comprehensive study of state-of-the-art schemes and techniques for protecting data confidentiality and access privacy, which make different tradeoffs in the multidimensional space of security, privacy, functionality and performance.

Divyakant Agrawal is a Professor of Computer Science at University of California, Santa Barbara. His research expertise is in the areas of database systems, distributed computing, data warehousing, and large-scale information systems.

Amr El Abbadi is a Professor of Computer Science at University of California, Santa Barbara. His research interests lie in the area of scalable database and distributed systems.

Shiyuan Wang is a PhD Candidate in Computer Science Department at University of California Santa Barbara. Her recent research interests are data security and privacy.

Understanding and Managing Cascades on Large Graphs

Abstract: How do contagions spread in population networks? Which group should we market to, for maximizing product penetration? Will a given YouTube video go viral? Who are the best people to vaccinate? What happens when two products compete? The objective of this tutorial is to provide an intuitive and concise overview of most important theoretical results and algorithms to help us understand and manipulate such propagation-style processes on large networks. The tutorial contains three parts: (a) Theoretical results on the behavior of fundamental models; (b) Scalable Algorithms for changing the behavior of these processes e.g., for immunization, marketing etc.; and (c) Empirical Studies of diffusion on blogs and on-line websites like Twitter.
The problems we focus on are central in surprisingly diverse areas: from computer science and engineering, epidemiology and public health, product marketing to information dissemination. Our emphasis is on intuition behind each topic, and guidelines for the practitioner.

B. Aditya Prakash (http://www.cs.cmu.edu/~badityap) is a Ph.D. student in the Computer Science Department at Carnegie Mellon University. He got his B.Tech (in CS) from the Indian Institute of Technology (IIT) - Bombay. He has published 15 refereed papers in major venues and holds two U.S. patents. His interests include Data Mining, Applied Machine Learning and Databases, with emphasis on large real-world networks and time-series.

Christos Faloutsos (http://www.cs.cmu.edu/~christos) is a Professor at Carnegie Mellon University. He is an ACM Fellow, he has published over 200 refereed articles and he has given over 30 tutorials in database and data mining venues. He has received the Presidential Young Investigator Award by NSF (1989), the Research Contributions Award in ICDM 2006, the SIGKDD Innovations Award (2010) and 18 "best paper" awards (including two "test of time" awards). His research interests include data mining for graphs and streams, fractals, database performance, and indexing for multimedia and bio-informatics data.

Mining Knowledge from Interconnected Data: A Heterogeneous Information Network Analysis Approach

Abstract: Most objects and data in the real world are interconnected, forming complex, heterogeneous but often semi-structured information networks. However, most people consider a database merely as a data repository that supports data storage and retrieval rather than one or a set of heterogeneous information networks that contain rich, inter-related, multi-typed data and information. Most network science researchers only study homogeneous networks, without distinguishing the different types of objects and links in the networks. In this tutorial, we view database and other interconnected data as heterogeneous information networks, and study how to leverage the rich semantic meaning of types of objects and links in the networks. We systematically introduce the technologies that can effectively and efficiently mine useful knowledge from such information networks.

Yizhou Sun is a Ph.D. candidate at the Department of Computer Science, University of Illinois at Urbana-Champaign. Her principal research interest is in mining information and social networks, and more generally in data mining, database systems, statistics, machine learning, information retrieval, and network science, with a focus on modeling novel problems and proposing scalable algorithms for large-scale, real-world applications. Yizhou has over 30 publications in book chapters, journals, and major conferences. Tutorials based on her thesis work on mining heterogeneous information networks have been given in several premier conferences, such as SIGMOD'10, SIGKDD'10 and ICDE'12.

Jiawei Han, Bliss Professor of Computer Science, University of Illinois at Urbana-Champaign. He has been researching into data mining, information network analysis, database systems, and data warehousing, with over 600 journal and conference publications. He has chaired or served on many program committees of international conferences, including PC co-chair for KDD, SDM, and ICDM conferences, and Americas Coordinator for VLDB conferences. He also served as the founding Editor-In-Chief of ACM Transactions on Knowledge Discovery from Data and is serving as the Director of Information Network Academic Research Center supported by U.S. Army Research Lab. He is a Fellow of ACM and IEEE, and received 2004 ACM SIGKDD Innovations Award, 2005 IEEE Computer Society Technical Achievement Award, 2009 IEEE Computer Society Wallace McDowell Award, and 2011 Daniel C. Drucker Eminent Faculty Award at UIUC. His book "Data Mining: Concepts and Techniques" has been used popularly as a textbook worldwide.

Xifeng Yan is an associate professor at the University of California at Santa Barbara. He holds the Venkatesh Narayanamurti Chair in Computer Science. He received his Ph.D. degree in Computer Science from the University of Illinois at Urbana-Champaign in 2006. He was a research staff member at the IBM T. J. Watson Research Center between 2006 and 2008. He has been working on modeling, managing, and mining graphs in bioinformatics, social networks, information networks, and computer systems. His works were extensively referenced, with over 5,000 citations per Google Scholar. He received NSF CAREER Award, IBM Invention Achievement Award, ACM-SIGMOD Dissertation Runner-Up Award, and IEEE ICDM 10-year Highest Impact Paper Award.

Philip S. Yu received his Ph.D. degree in E.E. from Stanford University. He is a Professor in Computer Science at the University of Illinois at Chicago and also holds the Wexler Chair in Information Technology. Dr. Yu spent most of his career at IBM, where he was manager of the Software Tools and Techniques group at the Watson Research Center. His research interests include data mining, database and privacy. He has published more than 680 papers in refereed journals and conferences. He holds or has applied for more than 300 US patents. Dr. Yu is a Fellow of the ACM and the IEEE. He is the Editor-in-Chief of ACM Transactions on Knowledge Discovery from Data. He was the Editor-in-Chief of IEEE Transactions on Knowledge and Data Engineering (2001-2004). He received a Research Contributions Award from IEEE Intl. Conference on Data Mining (2003).

Graph Synopses, Sketches, and Streams: A Survey

Abstract: Massive graphs arise in any application where there is data about both basic entities and the relationships between these entities, e.g., web-pages and hyperlinks; neurons and synapses; papers and citations; IP addresses and network flows; people and their friendships. Graphs have also become the de facto standard for representing many types of highly structured data. However, the sheer size of many of these graphs render classical algorithms inapplicable when it comes to analyzing such graphs. In addition, these existing algorithms are typically ill-suited to processing distributed or stream data.

Various platforms have been developed for processing large data sets.
At the same time, there is the need to develop new algorithmic ideas and paradigms. In the case of graph processing, a lot of recent work has focused on understanding the important algorithmic issues. An central aspect of this is the question of how to construct and leverage small-space synopses in graph processing. The goal of this tutorial is to survey recent work on this question and highlight interesting directions for future research.

Sudipto Guha is an Associate Professor in the Department of Computer and Information Sciences at University of Pennsylvania since Fall 2001. He completed his Ph.D. in 2000 at Stanford University working on approximation algorithms. He is a recipient of the NSF CAREER award in 2007, and the Alfred P. Sloan Foundation fellowship.

Andrew McGregor is an Assistant Professor of Computer Science of the University of Massachusetts, Amherst. He received his Ph.D. from the University of Pennsylvania in 2007 and spent two years as a postdoctoral researcher first at UC San Diego and then at Microsoft Research SVC. He was awarded the NSF CAREER Award in 2010 for his work on data streams and sketching.

Interoperability in eHealth Systems (invited tutorial)

Abstract: Interoperability in eHealth systems is important for delivering quality healthcare and reducing healthcare costs. Some of the important use cases include coordinating the care of chronic patients by enabling the co-operation of many different eHealth systems such as Electronic Health Record Systems (EHRs), Personal Health Record Systems (PHRs) and wireless medical sensor devices; enabling secondary use of EHRs for clinical research; being able to share life long EHRs among different healthcare providers. Although achieving eHealth interoperability is quite a challenge both because there are competing standards and clinical information itself is very complex, there have been a number of successful industry initiatives such as Integrating the Healthcare Enterprise (IHE) Profiles, as well as large scale deployments such as the National Health Information System of Turkey and the epSOS initiative for sharing Electronic Health Records and ePrescriptions in Europe.

This article briefly describes the subjects discussed in the VLDB 2012 tutorial to provide an overview of the issues in eHealth interoperability describing the key technologies and standards, identifying important use cases and the associated research challenges and also describing some of the large scale deployments. The aim is to foster further interest in this area.

Asuman Dogac is the general manager of SRDC Ltd. She was a full professor at the Computer Engineering Department of the Middle East Technical University till January 2011. SRDC Ltd. is a spin-off of the SRDC research center at the Middle East Technical University.