AMW School on Data Management

AMW DATA SCIENCE School 2017

The AMW DATA SCIENCE School is a two-day "Summer School" preceding the Alberto Mendelzon Workshop on Foundations of Data Management 2017 to be held in Montevideo, Uruguay. The event consists of multiple tutorials aimed at a mixed audience of students and other interested attendees.

Our goals

Host tutorials targeted at students (advanced undergraduate or postgraduate level) or other early term researchers interested in the area of Data Science;
Provide a venue where young Latin American students and researchers can meet, discuss, learn, and seek feedback on their research topics, thus reinforcing research networks (of the future) in the area.

School Program

Monday 5th june| Tuesday 6th june

Monday 5th June
8:45-9:00	Opening
9:00-12:00	Tutorial 1: Knowledge Collection And Knowledge Cleaning: Challenges, Models, And Applications Xin Luna Dong, Amazon Download Presentation
12:00-14:00	Lunch
14:00-17:00	Tutorial 2: Communication Cost In Parallel Query Processing Dan Suciu, University of Washington Download Presentation
17:00-18:00	Poster Session and Discussion

Tuesday 6th June
9:00-12:00	Tutorial 3: Data, Responsibly Julia Stoyanovich, Drexel University. Download Presentation
12:00-14:00	Lunch
14:00-17:00	Tutorial 4: Large Scale Distributed Data Science from scratch using Apache Spark 2.1+ James G. Shanahan, Church and Duncan Group and University of California, Berkeley Download Presentation
17:00-18:00	Poster Session and Discussion
19:30-22:30	AMW reception at Bar Tabaré

Tutorials

Tutorial 1: Knowledge Collection and Knowledge Cleaning: Challenges, Models, and Applications

Xin Luna Dong, Amazon.

Large-scale knowledge repositories are becoming increasingly important as a foundation for enabling a wide variety of complex applications. In turn, building high-quality knowledge repositories critically depends on the technologies of knowledge collection and knowledge cleaning, which share many similar goals with data integration, while facing even more challenges in extracting knowledge from both structured and unstructured data, across a large variety of domains, and in multiple languages. Our tutorial highlights the similarities and differences between knowledge management and data integration, and has two goals. First, we introduce the Database community to the techniques proposed for the problems of entity linkage and relation extraction by the Knowledge Management, Natural Language Processing, and Machine Learning communities. Second, we give a detailed survey of the work done by these communities in knowledge fusion, which is critical to discover and clean errors present in sources and the many mistakes made in the process of knowledge extraction from sources. Our tutorial is example driven and hopes to build bridges between the Database community and other disciplines to advance research in this important area.
Download Presentation

Speaker Bio

Xin Luna Dong is a Principal Scientist at Amazon, leading the efforts of constructing Amazon Product Knowledge Graph. She was one of the major contributors to the Knowledge Vault project, and has led the Knowledge-based Trust project, which is called the "Google Truth Machine" by Washington's Post. She has co-authored book "Big Data Integration", published 65+ papers in top conferences and journals and given 20+ keynotes/invited-talks/tutorials. She got the VLDB Early Career Research Contribution Award for advancing the state of the art of knowledge fusion, and got the Best Demo award in Sigmod 2005. She is the PC co-chair for Sigmod 2018 and WAIM 2015, and serves as an area chair for Sigmod 2017, Sigmod 2015, ICDE 2013, and CIKM 2011.

Tutorial 2: Communication Cost in Parallel Query Processing

Dan Suciu, University of Washington.

We consider the following problem: what is the amount of communication required to compute a query in parallel on p servers, over a large database instance? We define the Massively Parallel Communication (MPC) model, where the computation proceeds in rounds consisting of local computations followed by a global reshuffling of the data. Servers have unlimited computational power and are allowed to exchange any data, the only cost parameters are the number of rounds and the maximum amount of communication per server. Surprisingly, any multi-join query can be computed in a single communication round, however the price to pay is that the amount of data being reshuffled exceeds the input data. I will describe tight bounds on the amount of communication for the case of single round algorithms on non-skewed data, and discuss some partial results for multiple rounds and for skewed data. Joint work with Paul Beame and Paris Koutris
Download Presentation

Speaker Bio

Dan Suciu is a Professor in Computer Science at the University of Washington. He received his Ph.D. from the University of Pennsylvania in 1995, was a principal member of the technical staff at AT&T Labs and joined the University of Washington in 2000. Suciu is conducting research in data management, with an emphasis on topics related to Big Data and data sharing, such as probabilistic data, data pricing, parallel data processing, data security. He is a co-author of two books Data on the Web: from Relations to Semistructured Data and XML, 1999, and Probabilistic Databases, 2011. He is a Fellow of the ACM, holds twelve US patents, received the best paper award in SIGMOD 2000 and ICDT 2013, the ACM PODS Alberto Mendelzon Test of Time Award in 2010 and in 2012, the 10 Year Most Influential Paper Award in ICDE 2013, the VLDB Ten Year Best Paper Award in 2014, and is a recipient of the NSF Career Award and of an Alfred P. Sloan Fellowship. Suciu serves on the VLDB Board of Trustees, and is an associate editor for the Journal of the ACM, VLDB Journal, ACM TWEB, and Information Systems and is a past associate editor for ACM TODS and ACM TOIS. Suciu's PhD students Gerome Miklau, Christopher Re and Paris Koutris received the ACM SIGMOD Best Dissertation Award in 2006, 2010, and 2016 respectively, and Nilesh Dalvi was a runner up in 2008.

Tutorial 3: Data, Responsibly

Julia Stoyanovich, Drexel University.

Big Data technology holds incredible promise of improving people's lives, accelerating scientific discovery and innovation, and bringing about positive societal change. Yet, if not used responsibly, this technology can propel economic inequality, destabilize global markets and affirm systemic bias. In this tutorial we will focus on the importance of using Big Data technology responsibly – in a manner that adheres to the legal requirements and ethical norms of our society. We will define key properties, such as fairness, diversity, accountability, and transparency. We will give examples of concrete situations, many of which were covered in recent popular press, where reasoning about and enforcing these properties is important. We will then discuss potential modeling and algorithmic approaches for quantifying and enforcing responsible practices, using real datasets and application scenarios from criminal sentencing, credit scoring, and homelessness services.
Download Presentation

Speaker Bio

Julia Stoyanovich is an Assistant Professor of Computer Science at Drexel University. She was previously a postdoctoral researcher and a Computing Innovations Fellow at the University of Pennsylvania. Julia holds M.S. and Ph.D. degrees in Computer Science from Columbia University and a B.S. in Computer Science and Mathematics and Statistics from the University of Massachusetts at Amherst. Julia's research focuses on responsible data management and analysis practices, and on the management and analysis of preference data. She co-organized a Dagstuhl seminar "Data, Responsibly" in July 2016. Her work has been supported by the NSF, BSF and Google.

Tutorial 4: Large Scale Distributed Data Science from scratch using Apache Spark 2.1+

James G. Shanahan, Church and Duncan Group and University of California, Berkeley.

Apache Spark is an open-source cluster computing framework. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala, SQL and R (MapReduce has 2 core calls), and its core data abstraction, the distributed data frame. In addition, it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing. This tutorial will provide an accessible introduction to large-scale distributed machine learning and data mining, and to Spark and its potential to revolutionize academic and commercial data science practices. It is divided into two parts: the first part will cover fundamental Spark concepts, including Spark Core, functional programming ala map-reduce, RDDs/data frames/datasets, the Spark Shell, Spark Streaming and online learning, Spark SQL, MLlib, and more; the second part will focus on hands-on algorithmic design and development with Spark (developing algorithms from scratch such as decision tree learning, association rule mining (aPriori), graph processing algorithms such as pagerank/shortest path, gradient descent algorithms such as support vectors machines and matrix factorization, and deep learning. The home homegrown implementations will help shed some light on the internals of the MLlib libraries (and on the difficulties of parallelizing some key machine learning algorithms). Industrial applications and deployments of Spark will also be presented. Example code will be made available in python (pySpark) notebooks.
Download Presentation

Speaker Bio

Dr. James G. Shanahan has spent the past 25 years developing and researching cutting-edge artificial intelligent systems splitting his time between industry and academia.He has (co) founded several companies including: Church and Duncan Group Inc. (2007), a boutique consultancy in large scale AI which he runs in San Francisco; RTBFast (2012), a real-time bidding engine infrastructure play for digital advertising systems; and Document Souls (1999), a document-centric anticipatory information system. In 2012 he went in-house as the SVP of Data Science and Chief Scientist at NativeX, a mobile ad network that got acquired by MobVista in early 2016. In addition, he has held appointments at AT&T (Executive Director of Research), Turn Inc. (founding chief scientist), Xerox Research, Mitsubishi Research, and at Clairvoyance Corp (a spinoff research lab from CMU). He also advises several high-tech startups (including Quixey, Aylien, ChartBoost, DigitalBank, VoxEdu, and others). Dr. Shanahan has been affiliated with the University of California at Berkeley (and Santa Cruz) since 2008 where he teaches graduate courses on big data analytics, machine learning, deep learning, and stochastic optimization. In addition, he is currently visiting professor of data science at the University of Ghent, Belgium. He has published six books, more than 50 research publications, and over 20 patents in the areas of machine learning and information processing. Dr. Shanahan received his PhD in engineering mathematics from the University of Bristol, U. K., and holds a Bachelor of Science degree from the University of Limerick, Ireland. He is a EU Marie Curie fellow. In 2011 he was selected as a member of the Silicon Valley 50 (Top 50 Irish Americans in Technology).