XLDB - Extremely Large Databases

XLDB-2015 Abstracts

Accelerating Deep Learning at Facebook.
Keith Adams, Engineer, Facebook AI Research

Facebook AI Research (FAIR) is exploring the boundaries of machine understanding of images, video, acoustics, natural language, and structured information. Many recent advances in these AI are due to so-called "deep learning", a grab bag of techniques and models that shares the goal of learning intermediate representations of data instead of engineering them. Constructing successful deep learning models often involved huge training datasets, large models, and compute times measured in days, weeks or months. This talk will explore FAIR infrastructure's past and future efforts to accelerate these training cycles, from hardware, to software, to optimization.

DataSF: Open Data Initiatives in the City of San Francisco
Joy Bonaguro, Chief Data Officer, City and County of San Francisco.

Open data initiatives worldwide are improving access to urban data, including housing information, 311 calls, incident reports, transportation data, and much more, with the promise of delivering increased quality of life, more efficient government services, better decisions, and new businesses and services. DataSF's mission is to enable use of San Francisco data. Our core product is SF OpenData, our official open data portal. Launched in 2009, the data portal contains hundreds of city datasets for use by developers, analysts, residents, and more.

And we’re extending the open data approach beyond the simple publication of datasets to using open data as a means to reveal and unpack complex areas such as affordable housing, homelessness, and economic development. Many areas of government service touch multiple departments through multiple programs.

In this talk, I'll describe our initiatives around SF OpenData, as well as training programs such as Data Academy, which helps increase internal City capacity to effectively use our data. I'll also describe where new technology can help us improve the use of these resources.

Kubernetes and the Path to Cloud Native
Eric Brewer, Vice President of infrastructure at Google

We are in the midst of an important shift to higher-levels of abstraction than virtual machines. Kubernetes aims to simplify the deployment and management of services, including the construction of applications as sets of interacting but independent services. We explain some of the key concepts in Kubernetes and show how they work together to simplify evolution and scaling.

Critical Technologies Necessary for Big Data Exploitation
Steve Brobst, Chief Technology Officer, Teradata Corporation

The era of Big Data presents exciting opportunities for leveraging analytics to extract value from previously unexploited data sets. However, new tools and technologies will be required to realize value from these next generation sources of data analytics. In this talk we will examine emerging technologies and new requirements for full exploitation of Big Data.

Kurt Brown, Netflix

At Netflix, we've created a highly effective data platform. In this talk, I'll dive into the key facets underlying it, including technology choices, building vs. buying, using and contributing to open source software, leveraging the cloud, staffing, and some of our other (non-traditional) approaches and philosophy.

Paul Brown, Paradigm4

In addition to the novel analytic features we anticipate being necessary for modern, scientific ‘Big Data’, it is useful to survey what requirements distinguished successful analytic platforms in the past. These quality of service guarantees included highly abstracted programmatic interfaces, transactional quality of service guarantees, and the central importance of data quality and meta-data management.

ROOT: a data storage and analysis framework
Rene Brun, CERN

The ROOT system is developed at CERN in Geneva since 1995. Implemented in C++ and ported on most operating systems, it is today the main tool used worldwide in High Energy or Nuclear Physics to store and analyze tens of petabytes of data (soon exabytes). The system provides a large set of libraries, a C++11 compliant interpreter as well as a Python front-end. ROOT supports object-wise and member-wise streaming for any C++ object or collections and the corresponding querying mechanism. Interfaces to popular relational data bases are provided. A rich set of visualization tools for 1-d, 2-d, N-d distributions is included. Multivariate and popular statistical analysis classes are available.

Big Data Storage: Should We Pop the (Software) Stack?
Michael J. Carey, UC Irvine

The 1980's were a turbulent time for Big Data technology - also known, back then, as parallel database system technology. Shared-{everything, disk, nothing} battles were fought in conferences, benchmarks, and in the market place. Horizontally partitioned, database-managed storage is what most war veterans would say eventually won the day. Fast forward to the 2000's, and we have another turbulent time for Big Data technology - a.k.a. web-scale data indexing and analysis. Large-scale distributed file systems won the day in that world, bringing us the new enterprise Big Data storage: HDFS and friends. Most current Big Data analytics software stacks, and even some of the NoSQL key-value stacks, now sit on top of this layer. Big Data management technologies appear to be on a collision course, complicated by the Cloud, and it isn't clear, at least to this speaker, what's going to happen (or what should happen, or if/how related those two thoughts are). This talk will examine where we've been, where we are, and the approach that AsterixDB, a Big Data Management System co-developed by researchers at UC's Irvine and Riverside, is doing to hedge its bets with respect to the issue of scalable storage management.

R in the World: Interface between Languages
John M. Chambers, Stanford University

In the nearly forty-year-long evolution of the current R software, the data that we try to understand, and the statistical tools available to do so, have seen a huge expansion. R has contributed, most importantly as the vehicle for communicating new tools in an open, immediately usable form. While much has changed, the motivations that shaped the original design remain relevant, with implications for future directions.

ElasticR: Connecting the dots of scientific computing, from the pi to the clouds.
Karim Chine

ElasticR is a virtual data science platform enabling everyone to use cloud computing seamlessly and work with R, python and many other scientific computing tools in a productive, collaborative and agile way. ElasticR takes care transparently of security, resources creation and management, connectivity and sharing and helps federating and leveraging all available compute facilities could they be supercomputers, public or private clouds, clusters, PCs or cheap computing devices.

The platform makes everything programmable, traceable and reproducible in a consistent manner. It includes an innovative framework for highly dynamic and scriptable science gateways design and publication that anyone can use without prior IT knowledge. With ElasticR, the R language for programming with data becomes also a tool for architecting, deploying and sharing data science infrastructures, applications and services.

ElasticR aims at making the interaction between scientists and the compute technologies and tools easier than ever. By reducing the frictions that limit researchers' productivity and their ability to access and analyse data and share computational artifacts, by including real-time collaboration and social networking as core features embedded from the ground-up, the new platform lays the groundwork for a new data science-centric ecosystem.

The presentation will be an overview of the concepts and architectures behind ElasticR and will include a live demo of key use cases.

Scalability and AI
Adam Coates

Deep learning algorithms are driving progress in AI technologies including speech recognition, computer vision and natural language processing. Behind these many success stories is a remarkable growth in the scale of deep learning systems, powered by high performance computing and increasing digitization of the world's data. I will overview deep learning technology and how a focus on scalable learning systems is enabling rapid advances in AI, followed by a dive into recent work on state-of-the-art speech recognition at Baidu.

Enabling Low Friction Sharing, Discovery and Analysis of Heterogeneous Civic Data
Deep Dhillon, CTO Socrata

Data rich governments, multilateral, and non-profit organizations increasingly publish data to the public. Machine learning and rule driven dataset annotations are used to fuse external knowledge, facilitate data discovery, and generate enhanced user experiences. Construction of efficient, fault tolerant and distributed specialized indices enables rapid and flexible querying. Civic data, heterogeneous in nature, mostly tabular, often geo-spatial, and almost always increasing in size, presents enormous opportunities to empower citizens, improve government efficiency, and ultimately, strengthen democracy.

Big Data Analytics in the Utilities Industry (a background article)
Timotej Gavrilovic, Demand Side Analytics Team at PG&E

This talk presents an overview of data analytics for demand-side initiatives at PG&E - where PG&E is today and where it is heading in regard to big data analytics. We will provide practical examples of use of analytics on driving business performance - actual examples from PG&E given data governance and confidentiality restrictions.

Robert Gentleman

Chris Holcombe

A quick survey of the trials and tribulations regarding deploying and scaling distributed storage systems.

Apache Calcite: One planner fits all
Julian Hyde

Big Data Analytics in the Utilities Industry (a background article)
Colin Kerrigan, Demand Side Analytics Team at PG&E

This talk presents an overview of data analytics for demand-side initiatives at PG&E - where PG&E is today and where it is heading in regard to big data analytics. We will provide practical examples of use of analytics on driving business performance - actual examples from PG&E given data governance and confidentiality restrictions.

From Walled Kingdom to Toolbox
Hannes Muehleisen, Researcher at Centrum Wiskunde & Informatica (CWI), Amsterdam

Databases today appear as isolated kingdoms, inaccessible, with a unique culture and strange languages. To benefit from our field, we expect data and analysis to be brought inside these kingdoms. Meanwhile, actual analysis takes place in more flexible, specialised environments such as Python or R. There, the same data management problems reappear, and are solved by re-inventing core database concepts. We must work towards making our hard-earned results more accessible, by supporting (and re-interpreting) their languages, by opening up internals and by allowing seamless transitions between contexts.

The Sentient Enterprise: 5 Stages to Advanced Analytics Success
Oliver Ratzesberger, Senior Vice President-Software, Teradata Labs

Oliver Ratzesberger believes that the continued explosion of data and the continued evolution of analytics capabilities will usher in the next analytics revolution beyond the Intelligent Enterprise. He will paint a provocative picture of the evolution of analytics capabilities towards an ideal state called ‘The Sentient Enterprise’.

The Sentient Enterprise is an enterprise that can listen to data, conduct analysis and make autonomous decisions at massive scale in real-time. The Sentient Enterprise can listen to data to sense micro-trends. It can act as one organism without being impeded by information silos. It can make autonomous decisions with little or no human intervention. It is always evolving, with emergent intelligence that becomes progressively more sophisticated.

Oliver will present the evolutionary journey of analytics capabilities that begins with today’s Agile Data Warehouse and culminates in the Sentient Enterprise. He will share a simple maturity framework that has five foundational qualities and the five stages to assess your organization’s maturity and progress along this journey as well as “next practices” that IT can harness to unlock the full potential of big data and analytics.

On the Practice of Predictive Modeling with Big Data: The Extra Steps that Make the Difference
Nachum Shacham, Principal Data Scientist at PayPal

Beyond the selection of modeling environment ''ML platform, algorithms, and programming framework'' data scientists practicing big-data predictive modeling perform tasks of adjusting data, algorithms, and operation to address specific conditions of the problem at hand. Such adjustments address, among other conditions, unbalanced data and shallow datasets, missing values and outliers, feature engineering and selection, ensemble composition, processing speed, and results validation. These steps can strongly impact the success of the model. We will review alternatives and tradeoffs and draw from our experience in applying best practices while building models on the H2O platform using its R API.

How To Create the Google for Earth Data
Rainer Sternfeld, CEO Planet OS

SeeDB - Towards Automatic Visualization of Query Results
Manasi Vartak, MIT Database Group

Data analysts operating on large volumes of data often rely on visualizations to interpret the results of queries. However, finding the right visualization given a query of interest is a laborious and time-consuming task. In this talk, I will present an overview of SeeDB, a visualization recommendation system being developed at MIT. For a user query, SeeDB intelligently explores the space of possible visualizations, evaluates promising visualizations along various dimensions of interest, and recommends the most ?interesting? or ?useful? visualizations to the analyst. I will discuss the techniques and architecture underlying SeeDB, performance results, and preliminary user studies of SeeDB.

William Vambenepe, Lead Product Manager for Big Data on Google Cloud Platform

MapReduce was a major step in enabling large-scale data processing, but a lot has happened since. We'll examine progress, inside Google and outside, in providing more convenient and efficient ways to parallelize data processing, as well as drastically reduce the latency of data processing via a streaming-first approach. We'll see how these innovation fit in the larger context of Big Data processing tools to support innovative data-centric applications for a large community of developers.

There’s no data like more data
Theo Vassilakis, Founder & CEO of Metanautix

Is the “Data Lake” the new EDW? Is Cloud? Is Hadoop? Warehousing may not win over federation and virtualization this time - or a new balance may be struck. As economies of scale drive down cloud costs and networks get radically better at moving big data, the tyranny of data locality is eroding. Systems like Google’s Dremel showed the way towards always-on pure analytical services that lower TCO by de-coupling analysis from storage. Companies like Metanautix are taking advantage of compute capabilities that are now widely dispersed - on clouds, on-prem, or on small devices - to start wresting control of computation from the ever-changing storage layer. That compute power is now in the hands of a much broader population than the traditional centralized groups. Organizations are driven by business imperatives to bridge data silos not just within their walls, but up and down the data supply chain with their vendors and partners. Traditional RDBMS, excel spreadsheets and JSON are being joined with rich media such as videos, audio, and images, from entertainment to gas operations. This talk will focus on how adopting standards, using declarative computing, and designing for hybrid environments will help achieve speed, visibility and generality in the bewilderingly complex and expanding world of enterprise analytics

Visual Exploration of Big Urban Data
Huy Vo, Research Scientist, NYU-CUSP

About half of humanity lives in urban environments today and that number will grow to 80% by the middle of this century. Cities are thus the loci of resource consumption, of economic activity, and of innovation; they are the cause of our looming sustainability problems but also where those problems must be solved. Data, along with visualization and analytics can help significantly in finding these solutions. This talk will discuss the challenges of visual exploration of big urban data; and showcase two studies in New York City's taxi trips and building information. We work collaboratively with domain experts to design visual analytic frameworks that integrate multiple data layers and spatial analysis techniques to facilitate their decision making process. To tackle the challenge of big data, we have combined techniques from visualization and databases with novel indexing scheme for spatio-temporal datasets. Many of these involve working with 3D and operating in real time.

Stephen Wolfram, CEO, Wolfram Research

Join XLDB on LinkedIn
Follow XLDBConf on Twitter
Platinum Sponsors

Planet OS Logo

Teradata Logo

Gold Sponsors

snowflake logo

Silver Sponsors

MonetDB Logo

Synthos Technologies Logo

Vertica Logo


Gordon and Betty Moore Foundation Logo

Stanford University does not purchase or endorse any sponsors’ products or services.
Privacy Statement -