
AMW DATA SCIENCE School 2017

The AMW DATA SCIENCE School is a two-day "Summer School" preceding the Alberto Mendelzon Workshop on Foundations of Data Management 2017 to be held in Montevideo, Uruguay. The event consists of multiple tutorials aimed at a mixed audience of students and other interested attendees.
Our goals
- Host tutorials targeted at students (advanced undergraduate or postgraduate level) or other early term researchers interested in the area of Data Science;
- Provide a venue where young Latin American students and researchers can meet, discuss, learn, and seek feedback on their research topics, thus reinforcing research networks (of the future) in the area.
School Program
Monday 5th June |
|
8:45-9:00 | Opening |
9:00-12:00 | Tutorial 1: Knowledge Collection And
Knowledge Cleaning: Challenges, Models, And Applications
Xin Luna Dong, Amazon Download Presentation |
12:00-14:00 | Lunch |
14:00-17:00 | Tutorial 2: Communication Cost In Parallel Query Processing
Dan Suciu, University of Washington Download Presentation |
17:00-18:00 | Poster Session and Discussion |
Tuesday 6th June |
|
9:00-12:00 | Tutorial 3: Data, Responsibly
Julia Stoyanovich, Drexel University. Download Presentation |
12:00-14:00 | Lunch |
14:00-17:00 | Tutorial 4: Large Scale Distributed Data Science from scratch using Apache Spark 2.1+
James G. Shanahan, Church and Duncan
Group and University of California, Berkeley Download Presentation |
17:00-18:00 | Poster Session and Discussion |
19:30-22:30 | AMW reception at Bar Tabaré |
Tutorials

Tutorial 1: Knowledge Collection and Knowledge Cleaning: Challenges, Models, and Applications
Xin Luna Dong, Amazon.
Large-scale knowledge repositories are becoming increasingly important as a foundation for enabling a wide variety
of complex applications. In turn, building high-quality knowledge repositories critically depends on the technologies
of knowledge collection and knowledge cleaning, which share many similar goals with data integration, while facing even
more challenges in extracting knowledge from both structured and unstructured data, across a large variety of domains,
and in multiple languages. Our tutorial highlights the similarities and differences between knowledge management
and data integration, and has two goals. First, we introduce the Database community to the techniques proposed
for the problems of entity linkage and relation extraction by the Knowledge Management, Natural Language Processing,
and Machine Learning communities. Second, we give a detailed survey of the work done by these communities in knowledge
fusion, which is critical to discover and clean errors present in sources and the many mistakes made in the process of
knowledge extraction from sources. Our tutorial is example driven and hopes to build bridges between the Database
community and other disciplines to advance research in this important area.
Download Presentation
Speaker Bio
Xin Luna Dong is a Principal Scientist at Amazon, leading the efforts of constructing Amazon Product Knowledge Graph. She was one of the major contributors to the Knowledge Vault project, and has led the Knowledge-based Trust project, which is called the "Google Truth Machine" by Washington's Post. She has co-authored book "Big Data Integration", published 65+ papers in top conferences and journals and given 20+ keynotes/invited-talks/tutorials. She got the VLDB Early Career Research Contribution Award for advancing the state of the art of knowledge fusion, and got the Best Demo award in Sigmod 2005. She is the PC co-chair for Sigmod 2018 and WAIM 2015, and serves as an area chair for Sigmod 2017, Sigmod 2015, ICDE 2013, and CIKM 2011.

Tutorial 2: Communication Cost in Parallel Query Processing
Dan Suciu, University of Washington.
We consider the following problem: what is the amount of communication
required to compute a query in parallel on p servers, over a large
database instance? We define the Massively Parallel Communication
(MPC) model, where the computation proceeds in rounds consisting of
local computations followed by a global reshuffling of the data.
Servers have unlimited computational power and are allowed to exchange
any data, the only cost parameters are the number of rounds and the
maximum amount of communication per server. Surprisingly, any
multi-join query can be computed in a single communication round,
however the price to pay is that the amount of data being reshuffled
exceeds the input data. I will describe tight bounds on the amount of
communication for the case of single round algorithms on non-skewed
data, and discuss some partial results for multiple rounds and for
skewed data. Joint work with Paul Beame and Paris Koutris
Download Presentation
Speaker Bio
Dan Suciu is a Professor in Computer Science at the University of Washington. He received his Ph.D. from the University of Pennsylvania in 1995, was a principal member of the technical staff at AT&T Labs and joined the University of Washington in 2000. Suciu is conducting research in data management, with an emphasis on topics related to Big Data and data sharing, such as probabilistic data, data pricing, parallel data processing, data security. He is a co-author of two books Data on the Web: from Relations to Semistructured Data and XML, 1999, and Probabilistic Databases, 2011. He is a Fellow of the ACM, holds twelve US patents, received the best paper award in SIGMOD 2000 and ICDT 2013, the ACM PODS Alberto Mendelzon Test of Time Award in 2010 and in 2012, the 10 Year Most Influential Paper Award in ICDE 2013, the VLDB Ten Year Best Paper Award in 2014, and is a recipient of the NSF Career Award and of an Alfred P. Sloan Fellowship. Suciu serves on the VLDB Board of Trustees, and is an associate editor for the Journal of the ACM, VLDB Journal, ACM TWEB, and Information Systems and is a past associate editor for ACM TODS and ACM TOIS. Suciu's PhD students Gerome Miklau, Christopher Re and Paris Koutris received the ACM SIGMOD Best Dissertation Award in 2006, 2010, and 2016 respectively, and Nilesh Dalvi was a runner up in 2008.

Tutorial 3: Data, Responsibly
Julia Stoyanovich, Drexel University.
Big Data technology holds incredible promise of improving people's lives,
accelerating scientific discovery and innovation, and bringing about
positive societal change. Yet, if not used responsibly, this technology
can propel economic inequality, destabilize global markets and affirm
systemic bias. In this tutorial we will focus on the importance of using
Big Data technology responsibly – in a manner that adheres to the legal
requirements and ethical norms of our society.
We will define key properties, such as fairness, diversity, accountability,
and transparency. We will give examples of concrete situations,
many of which were covered in recent popular press, where reasoning
about and enforcing these properties is important.
We will then discuss potential modeling and algorithmic approaches
for quantifying and enforcing responsible practices,
using real datasets and application scenarios from criminal sentencing,
credit scoring, and homelessness services.
Download Presentation
Speaker Bio
Julia Stoyanovich is an Assistant Professor of Computer Science at Drexel University. She was previously a postdoctoral researcher and a Computing Innovations Fellow at the University of Pennsylvania. Julia holds M.S. and Ph.D. degrees in Computer Science from Columbia University and a B.S. in Computer Science and Mathematics and Statistics from the University of Massachusetts at Amherst. Julia's research focuses on responsible data management and analysis practices, and on the management and analysis of preference data. She co-organized a Dagstuhl seminar "Data, Responsibly" in July 2016. Her work has been supported by the NSF, BSF and Google.

Tutorial 4: Large Scale Distributed Data Science from scratch using Apache Spark 2.1+
James G. Shanahan, Church and Duncan Group and University of California, Berkeley.
Apache Spark is an open-source cluster computing framework. It has emerged as the next
generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data
revolution. Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in a few
important ways: it is much faster (100 times faster for certain applications), much easier to program in due
to its rich APIs in Python, Java, Scala, SQL and R (MapReduce has 2 core calls), and its core data
abstraction, the distributed data frame. In addition, it goes far beyond batch applications to support a
variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph
processing.
This tutorial will provide an accessible introduction to large-scale distributed machine learning and data
mining, and to Spark and its potential to revolutionize academic and commercial data science practices. It
is divided into two parts: the first part will cover fundamental Spark concepts, including Spark Core,
functional programming ala map-reduce, RDDs/data frames/datasets, the Spark Shell, Spark Streaming
and online learning, Spark SQL, MLlib, and more; the second part will focus on hands-on algorithmic
design and development with Spark (developing algorithms from scratch such as decision tree learning,
association rule mining (aPriori), graph processing algorithms such as pagerank/shortest path, gradient
descent algorithms such as support vectors machines and matrix factorization, and deep learning. The
home homegrown implementations will help shed some light on the internals of the MLlib libraries (and on
the difficulties of parallelizing some key machine learning algorithms). Industrial applications and
deployments of Spark will also be presented. Example code will be made available in python (pySpark)
notebooks.
Download Presentation
Speaker Bio
Dr. James G. Shanahan has spent the past 25 years developing and researching cutting-edge artificial intelligent systems splitting his time between industry and academia.He has (co) founded several companies including: Church and Duncan Group Inc. (2007), a boutique consultancy in large scale AI which he runs in San Francisco; RTBFast (2012), a real-time bidding engine infrastructure play for digital advertising systems; and Document Souls (1999), a document-centric anticipatory information system. In 2012 he went in-house as the SVP of Data Science and Chief Scientist at NativeX, a mobile ad network that got acquired by MobVista in early 2016. In addition, he has held appointments at AT&T (Executive Director of Research), Turn Inc. (founding chief scientist), Xerox Research, Mitsubishi Research, and at Clairvoyance Corp (a spinoff research lab from CMU). He also advises several high-tech startups (including Quixey, Aylien, ChartBoost, DigitalBank, VoxEdu, and others). Dr. Shanahan has been affiliated with the University of California at Berkeley (and Santa Cruz) since 2008 where he teaches graduate courses on big data analytics, machine learning, deep learning, and stochastic optimization. In addition, he is currently visiting professor of data science at the University of Ghent, Belgium. He has published six books, more than 50 research publications, and over 20 patents in the areas of machine learning and information processing. Dr. Shanahan received his PhD in engineering mathematics from the University of Bristol, U. K., and holds a Bachelor of Science degree from the University of Limerick, Ireland. He is a EU Marie Curie fellow. In 2011 he was selected as a member of the Silicon Valley 50 (Top 50 Irish Americans in Technology).