XLDB - Extremely Large Databases

XLDB-2016 Abstracts



Per Brashers, Founder of Yttibrium LLC, inventor, strategist

In this talk we will examine the forces both pro and con that are enabling a new paradigm of compute and storage in warehouse style compute centers. We will briefly discuss different viewpoints all along the data lifecycle, as well as do a in-depth analysis of what architectural tradeoffs there are to be had in scale-computing.


Vinayak Borkar, Chief Technology Officer and Executive Vice President of Engineering, X15 Software, Inc.

Traditional database management systems have mostly been designed and optimized to operate in a "schema-first" setting. Before any data can be loaded into a relational database, the user must provide table definitions that restrict the type of data that can be stored in those tables. In return, applications enjoy the benefits of schema enforcement preventing invalid data from being recorded in the database. The database also uses the type information provided in the schema to store data in an optimized manner and to execute queries efficiently. While the "schema-first" paradigm perhaps works well for applications where the data model is closely controlled by the owner of the application, it creates a liability for applications that ingest data whose structure is not well understood at the time of ingestion.

Log data analysis applications analyze log data collected from a variety of systems and applications with the goal of making sense of system and user behavior across these entities. Log data are inherently semi-structured and usually have system-specific formats. Log formats can change from one version to another of the producing application in unanticipated ways. Requiring structure to be extracted into a strict schema before storage leads to brittle systems that invariably end up losing some information from incoming data or introducing delays in getting the data prepared for analysis. These kinds of applications are better served by systems that allow late-binding of a schema during analysis. Usually, systems that allow late-binding (like semi-structured databases and document stores) have relatively higher storage costs and worse access-performance than relational databases due to the "bloat" caused by schema-ignorance.

In this talk, we will explore some design and implementation strategies that can be used to optimize the performance of queries in spite of not compromising the benefits of late-binding.


Andrew Caldwell, Senior principle engineer in the Big Data division of AWS

With the ubiquity of data sources and cheap storage, today's enterprises want to collect and store a wide variety of data, even before they know what to do with it. Examples include IOT streams, application monitoring logs, point-of-sale transactions, ad impressions, mobile events, and more. This data is typically a mix of structured and unstructured, streaming and static, with varying degree of quality. Given this variety and the increasing need to be data-driven, customers want a choice of tools to leverage this data for business advantage.

Towards this end, Amazon Web Services (AWS) offers a variety of fully-managed data services that can be easily composed given its service-oriented architecture. In this talk, we provide an overview of the breadth of data services available on AWS: storage, OLTP, data warehouse, and streaming. We give examples of how customers leverage and compose these to handle their big data use cases from traditional BI and analytics to real-time processing and prediction. Finally, we touch on some lessons from operating such services at scale.


Tina Cartaro, Physicist and software developer, SLAC National Accelerator Laboratory


Glenn Chisholm, CTO Cylance

Breakthroughs in data analytics seem almost commonplace in the recent years, and yet the effective application of data analytics to information security seems elusive, as evidenced by the increasing frequency of cyber breaches in the news. If a company can infer (highly private) customer information from click streams and purchase history, it should be possible to apply those same techniques to discover malicious behavior in computing systems. This session discusses the problem of using data in security: what makes the problem different than in other domains, where the scalability and complexity problems are, what techniques are successful (or unsuccessful), and future directions in data management and analytics for security.


Shirshanka Das, Architect for LinkedIn’s Data & Analytics team

LinkedIn has a rich ecosystem of data-driven products like People you may know, Who viewed my Profile and a multitude of recommendation products as well as business facing insights products. Building a data product end-to-end requires a lot of technologies to come together and work seamlessly, requiring innovations far beyond traditional OLTP and data warehousing techniques.
A major focus at LinkedIn has been to improve the agility of the engineers and data scientists in creating these data products end to end. To that end, we have developed a number of systems in the data ecosystem. These include Espresso, our distributed online datastore, Kafka, our central activity pipeline for carrying all our user activity and logging data, Gobblin a platform to ingest and manage a variety of batch and streaming data sources at scale, Samza and Hadoop which work in tandem to form the computation layer for real-time and batch data, and Pinot - a platform for extremely fast OLAP serving.
These systems power a multitude of analytics use-cases such as reporting, A/B testing and enable features such as real-time root cause analysis and anomaly detection. In this talk, we will go into the details of some of these systems and show how they work together to provide a self-service real-time data ecosystem.


Binny Gill, Chief Architect Nutanix

When storage became faster, it challenged the decades old 3-tier architecture, which lead to the birth of hyperconvergence. The aftermath of hyperconvergence will be far-reaching going beyond just the notion of bringing compute and storage together. It will enable building a true software-defined data-center platform that brings about a massive consolidation of hitherto custom-built appliances and functions in the data center. Agility in application management will become key, devops will become real, and a platform that enables hybrid consumption of clouds - both private and public will be born. We are witnessing the dawn of the era of the Enterprise Cloud.


Julian Hyde, Database Architect (Hortonworks)

Streaming is a paradigm for data processing that is rapidly growing in popularity, because it allows high throughput, low latency responses, and efficiently manages multitudes of IoT devices. Is it an alternative to database processing, or is it complementary? Julian Hyde argues for applying the database paradigm to streaming systems, using SQL as a high-level language for streaming. He presents streaming SQL, a super-set of standard SQL developed in collaboration with several Apache projects, and the use cases it can solve, such as combining data in flight with historic data at rest. He also shows how query optimization techniques can make streaming applications more efficient.


Raffael Marty, VP Security Analytics (Sophos)

Ensuring security of a company’s data and infrastructure has largely become a data analytics challenge. It is about finding and understanding patterns and behaviors that are indicative of malicious activities or deviations from the norm. Data, Analytics, and Visualization are used to gain insights and discover those malicious activities. These three components play off of each other, but also have their inherent challenges. A few examples will be given to explore and illustrate some of these challenges.


Mohan Kumar, Intel Fellow, Data Center Group

This talk provides an overview of Intel Rack Scale Architecture and discusses how this architecture addresses underutilized and stranded resources in a Data center through resource pooling! We will also specifically discuss concept of a pooled system, storage node, pooling of PCIe as well as NVMe based storage. The impact of pooling on latency, radix and failure domains will also be discussed. Further pooling introduces a need for composition of the platform. We will also discuss the characteristics of such platform composition how software can emerge to take advantage of these capabilities!


Luke McConoughey, Silicon Valley Bank Group

The Security Operations Center (SOC) in a corporation is charged with protecting and defending assets and operations from attackers both large and small who apply constantly changing tactics with tools that only seem to increase in sophistication. Details of day-to-day tasks (sifting through alerts, identifying threats, etc.) will be examined, noting how data provides insight and assistance. Tracking longer-term campaigns from advanced persistent threats (APT) will also be discussed.


Nirav Merchant, Director of Bio Computing and co-PI for the NSF-funded life sciences cyberinfrastructure platform CyVerse (formerly iPlant Collaborative)

Over the last decade, the discipline of life sciences has benefited tremendously from new, massively parallel, and highly quantitative technologies. These technologies have facilitated rapid data acquisition at an increasingly higher resolution and throughput across all forms of modalities. Transformational advances in information technology have complemented and fueled this phenomenal growth in data acquisition, including cloud and high performance computing, large-scale data management systems, and high-bandwidth networks.

Managing the lifecycle of these datasets from acquisition and analysis to publication and archiving often necessitates interdisciplinary collaborations with geographically distributed teams of experts. A common requirement for these interdisciplinary teams is access to integrated computational platforms that are flexible and scalable. These platforms must provide access to appropriate hardware and software resources that support diverse data types, computational scalability needs, and the usage patterns of diverse research communities.

CyVerse, a National Science Foundation (NSF) funded cyberinfrastructure (CI) project launched in 2008 (as iPlant Collaborative), is meeting the needs associated with managing data-driven research, collaborations, and discoveries. While originally targeted toward the plant science research community, it now has the expanded mandate to provide CI support across the life sciences.

This talk will provide an overview of core capabilities of CyVerse CI and experiences in onboarding of user communities (with varying levels of computational skills and expertise) on use of shared computational infrastructure.


Jelena Pjesivac-Grbovic, senior software engineer in Systems Infrastructure at Google

Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business (e.g. Web logs, mobile usage statistics, and sensor networks). At the same time, consumers of these datasets have evolved sophisticated requirements, such as event-time ordering and windowing by features of the data themselves. On top of that -- consumers want answers now. At Google, we’ve evolved our earlier work on batch and streaming systems (including MapReduce, FlumeJava, and Millwheel) into Dataflow (Apache Beam) a new programming model that allows users to clearly trade off correctness, latency, and cost. I’ll provide an overview of this model, use-cases that inspired it, and deep dive into how Dataflow makes processing of unbounded, unordered data accessible and efficient.


Prabhat, Data and Analytics Services Group Lead, National Energy Research Scientific Computing Center (NERSC)

Berkeley Lab and NERSC are at the frontier of scientific research. Historically, NERSC has provided leadership computing for the computational science community, but we now find ourselves tackling Big Data problems from an array of observational and experimental sources. In this talk, I will review the broad landscape of Data Analytics and Data Management problems at NERSC. An important class of problems require real-time streaming and analytics. Examples include real-time transient detection from data pipelines in astronomical observations, real-time material re-construction from light sources, and compound identification in real-time from mass-spectrometry data sets. I will review our hardware and software strategy for tackling these workloads. The talk will conclude with opportunities to engage with NERSC, Berkeley Lab, and the scientific enterprise in the Big Data arena.


Raghu Ramakrishnan, CTO for Data, Technical Fellow, Microsoft

Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-of-business operations; now, exploratory and predictive analysis is becoming ubiquitous, and the default increasingly is to capture and store any and all data, in anticipation of potential future strategic value. These differences in data heterogeneity, scale and usage are leading to a new generation of data management and analytic systems, where the emphasis is on supporting a wide range of very large datasets that are stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools, e.g., for machine learning and stream analytics. These new systems are necessarily based on scale-out architectures for both storage and computation.

Hadoop has become a key building block in the new generation of scale-out systems. On the storage side, HDFS has provided a cost-effective and scalable substrate for storing large heterogeneous datasets. However, as key customer and systems touch points are instrumented to log data, and Internet of Things applications become common, data in the enterprise is growing at a staggering pace, and the need to leverage different storage tiers (ranging from tape to main memory) is posing new challenges, leading to caching technologies, such as Spark. On the analytics side, the emergence of resource managers such as YARN has opened the door for analytics tools to bypass the Map-Reduce layer and directly exploit shared system resources while computing close to data copies. This trend is especially significant for iterative computations such as graph analytics and machine learning, for which Map-Reduce is widely recognized to be a poor fit.

While Hadoop is widely recognized and used externally, Microsoft has long been at the forefront of Big Data analytics, with Cosmos and Scope supporting all internal customers. These internal services are a key part of our strategy going forward, and are enabling new state of the art external-facing services such as Azure Data Lake and more. I will examine these trends, and ground the talk by discussing the Microsoft Big Data stack


Greg Schvey, Founding Partner of Axoni

Blockchain technology has driven a new wave of profound technical innovations, yet remains largely misunderstood both in its fundamentals and enterprise applications. This nascent infrastructure technology opens the door for low-cost, reliable, distributed data infrastructure across a range of industries by maintaining a synchronized ledger of messages between parties. The presentation will be focused on processing and managing large data sets generated by distributed ledger networks, as viewed through the lens of capital markets applications. This includes data synchronization across financial institutions, computational consensus for event processing in a distributed network, and the uniquely powerful but complex aspects of managing and analyzing blockchain data.


Mehul Shah, Software Development manager at Amazon

With the ubiquity of data sources and cheap storage, today's enterprises want to collect and store a wide variety of data, even before they know what to do with it. Examples include IOT streams, application monitoring logs, point-of-sale transactions, ad impressions, mobile events, and more. This data is typically a mix of structured and unstructured, streaming and static, with varying degree of quality. Given this variety and the increasing need to be data-driven, customers want a choice of tools to leverage this data for business advantage.

Towards this end, Amazon Web Services (AWS) offers a variety of fully-managed data services that can be easily composed given its service-oriented architecture. In this talk, we provide an overview of the breadth of data services available on AWS: storage, OLTP, data warehouse, and streaming. We give examples of how customers leverage and compose these to handle their big data use cases from traditional BI and analytics to real-time processing and prediction. Finally, we touch on some lessons from operating such services at scale.


Kelly Stirman, VP Strategy, Product Marketing at MongoDB, Inc.

Most engineering projects are the end result of a series of compromises made to make the best use of resources and time in partial fulfillment of the original goals. MongoDB set out to build a new kind of distributed database that easily scales, mostly takes care of itself, and preserves the key features that have made relational databases so useful for decades. Some of the decisions we made early turned out to provide benefits well beyond what we anticipated, while others produced unintended challenges. In this talk we will explore the goals and design of MongoDB, explore a number of key decisions we made, and describe some of the future decisions we will need to make in the future.


Eric Tschetter, Creator of Druid, distinguished engineer at Yahoo

The costs of processing data generally scale with the amount of data to be processed. One of the staple techniques for reducing the size of a data set is summarization, where you choose to remove some dimensions and aggregate apriori. Summarization has the negative consequence of losing some fidelity of data. For example, dimensions like "user id" which have a high cardinality tend to be the first targets for summarization, but their removal also has the unfortunate side-effect that you lose the ability to compute the number of unique users.

This talk introduces a strategy to recover some of the fidelity lost to summarization by leveraging sketches, a class of approximation algorithms. Specifically, I will introduce three sketch types that we use at Yahoo for both internal and external reporting purposes: theta sketches, tuple sketches and quantile sketches. We believe that this field of study is key to the future of processing large data sets, yet it is still relatively nascent. With this talk, we hope to motivate increased research into this approach to scaling data processing systems and foster the exploration of more novel algorithms and methodologies for maintaining fidelity in summaries.


Henry Zhang, Senior Product Manager, Amazon Web Services

Preserving PB-scale datasets for the long-term is a challenging task that calls for a reliable, scalable, and cost-effective solution. In this session, we explore how customers from various industries are using Amazon Glacier to build data archiving applications in the cloud. We discuss how to identify the most applicable workloads and recommend a few best practices on data management, ingest, retrieval, and security controls to help those of you in search for a cloud archiving solution.


Marcin Zukowski, Co-Founder of Snowflake Computing

Snowflake is a multi-tenant, transactional, secure, highly scalable and elastic system with full SQL support and built-in extensions for semi-structured and schema-less data. Snowflake introduces a new DBMS architecture decoupling storage and compute, providing both performance benefits and cost savings, as well as novel usage scenarios. The system is offered as a pay-as-you-go cloud service. Snowflake is used in production by a rapidly growing number of organizations. The system runs millions of analytical queries per day over multiple petabytes of raw data.

Join XLDB on LinkedIn
Follow XLDBConf on Twitter
Platinum Sponsors
Gold Sponsors

Oracle Logo

Silver Sponsors

Monetdb Logo

Couch Logo

Other Support

National Science Foundation

Stanford University does not purchase or endorse any sponsors’ products or services.
Privacy Statement -