XLDB - Extremely Large Databases

Accepted Lightning Talk Abstracts

Florin Rusu / University of California, Merced

Since the datasets and the models used in machine learning tasks become larger and larger, the time interval for calibration becomes shorter. Thus, efficiency in training large-scale machine learning models has become a key concern. Distributed gradient descent optimization is the most wide-spread algorithm to train large-scale machine learning models. It is well studied in the research community and implemented in all the popular data analytics frameworks.

In this talk, we present a scalable and efficient parallel solution for distributed gradient descent optimization. It has three distinct features. First, the efficiency of our solution is limited only by the I/O throughput, while linear speedup is achieved. Second, we integrate both data and model parallelism in the training process. Third, by utilizing multi-query processing and online aggregation techniques, we support concurrent hyper-parameter testing and convergence detection on-the-fly. We present implementations of the solution for several tasks, including logistic regression, SVM, and matrix factorization, in GLADE. They achieve more than an order of magnitude improvement over the state-of-the-art for certain models.

Robert Grossman / University of Chicago

As biomedical datasets continue to grow in size, clouds and commons of biomedical data are beginning to play an important role in the research community. We describe the architecture of the Genomic Data Commons (GDC) and Bionimbus Protected Data Cloud (PDC), which are petabyte scale computing platforms for genomic data and associated clinical data. We also describe how large scale computations are being done over the data in the GDC and PDC and some of the ways that biomedical commons and biomedical clouds are beginning to interoperate and to peer through digital ID and metadata services.

Josh Walters / Yahoo

At 150 billion events per day, the audience data pipelines at Yahoo operate at the very largest scale. This provides us with many interesting and difficult challenges, one of which is deep data partitioning, all while operating at an ever increasing data load.

Data partitioning helps to bring structure to complex data sets by splitting the data into separate logical groups. Splitting data into many nested groups, called deep data partitioning, allows for even more fine grained queries by analysts while utilizing fewer computational resources.

In this talk we present dynamic fractional partitioning, a method for deep data partitioning that is currently in use in Yahoo's largest MapReduce data pipelines. By performing ad-hoc data sampling during batch data processing, we are able to dynamically calculate the estimated size of deeply nested data partitions on the fly. This allows us to perform near optimal partitioning in a secondary MapReduce stage with minimal partition locality splitting.

Siying Dong / Facebook

MySQL and InnoDB have been a good tool for our transaction processing workloads. In many Facebook's MySQL+InnoDB services that use SSD, the primary bottleneck is storage capacity rather than random or sequential device throughput. A secondary bottleneck is device lifetime that is determined by the total number of bytes written to the device. In experimental comparisons between RocksDB and InnoDB we determined that compressed InnoDB would use 2X more storage and write 3X more bytes to storage. These results have been tested using web-scale workloads as well as benchmarks. RocksDB has less write and space amplification because it uses a variation of the Log Structured Merge Tree algorithm while InnoDB is an update-in-place B-Tree. We anticipate RocksDB will be able to replace many uses of InnoDB and have work in progress to get the same quality of service from it that we currently provide with InnoDB.

Hannes Muhleisen / Centrum Wiskunde & Informatica (CWI)

Data analyses are increasing in complexity, from groups and counts to statistical models and probabilities. This increased complexity requires more elaborate prototyping. Typically, these prototypes will be written in a data-centric scripting language such as R or Python. However, the runtime environments behind those languages are incapable of keeping up with increasing amounts of data. Hence, expensive re-coding of the analysis is currently inevitable.
To address the problem, we propose to treat data analysis programs as a declarative intent rather than an imperative contract written in stone. Doing so allows us to apply classical query optimisation to these programs, for example reordering joins or pushing down selections.

To perform these optimisations, we transform a data analysis program into a function call graph, where we can identify the data access operators and also modify the call graph before actual execution. By applying relational optimisation methods, we hope to be able to significantly reduce the computational cost of complex statistical analyses. However, the presence of arbitrary user code in these analysis scripts requires careful checking whether other code has side effects. These would prohibit arbitrary reshuffling of the execution order. This challenge can be compared to query optimisation in the presence of many arbitrary User-Defined Functions. Also, the sheer amount of different operators requires adaption of the relational equivalency rules.

We demonstrate our approach on a large-scale survey analysis use case. Overall, we aim to increase the capabilities of statistical programs to handle data and make the transition between prototypical data analysis and production deployment a matter of choosing the right interpreter.

At Yahoo, we have one of the most complex set of needs around data. This coupled with the fact that we have one of the largest data volumes in the world makes for interesting problems requiring state of the art solutions. Our core system is built to deliver on all our use-cases with scalability, performance, flexibility and extendability as the key guiding principles. In this talk, we will focus on our system architecture including Hadoop, Storm, Kafka, Sketches, Druid and a bit of the glue that holds it all. We will also touch upon some of the key practical challenges and our open source contributions.

Malu Castellanos / HP

In the Big Data era, enterprises are storing increasing amounts of data with the goal of getting valuable insights to gain competitive advantage, which often involves applying complex analytics such as machine learning functions. This has fueled the emergence of specialized processing engines (e.g., Spark, Giraffe) that offer a large repertoire of analytic functions such as machine learning and graph algorithms. These system architectures are often designed for specific problems and have different computational models, such as vertex-central computations or stream processing. However, enterprises typically store their important data in a commercial RDBMS in order to get the benefits of a mature and robust data management system.

While most databases support user-defined functions and some have leveraged them to implement some machine learning algorithms, the functionality currently available to users is limited. In particular, for those functions that are not available in the database a user must implement them either in SQL or a language supported by the database's user-defined function (UDF) interface. This is problematic because writing these algorithms in SQL or a supported language can be very cumbersome. Alternatively, a user could export the data into a system that supports the analytics functions that he needs. However, this requires manually exporting a potentially large amount of data from one system to another in a format that is understood by both systems, which is a non-trivial effort, and gives up the guarantees provided by the RDBMS.

In this work, we describe a framework to extend the computational capability of the Vertica database engine by integrating at different levels with specialized analytic engines, enabling users to access their functionality directly from within the database. This framework represents a step toward solving the problems of re-implementing advanced processing functions in a natively supported database language and manually transferring data between the database and specialized processing engines.

Rajkumar Sen / MemSQL

For a real-time database for analytics and transaction processing like MemSQL, complex analytical queries are a norm and need to be answered within a second or a few seconds. In order to achieve that, the query optimizer should generate the most efficient execution plan for any complex query. However, given that the queries themselves should be answered with a few seconds, the time budget for optimizing a query is very less. This creates a new challenge in query optimization in that time consuming query optimization components like generation of bushy join trees, cost-based query rewrites etc. need to be done intelligently so that the optimizer still generates a plan that is close to the optimal one and still maintaining the time budget. MemSQL query optimizer attempts to solve the problem by doing a few smart things (a) Divide the query optimization decisions into a global cluster-wide decision and a local node level decision (b) Push down query operations as much as possible into lowest levels of the query tree and (c) Generate bushy joins as part of query rewrite module rather than the join order determining module.

John Poelman / IBM

Having realized the lack of formal methodologies and comprehensive tools to measure the performance of Big Data systems, Chaitan Baru, Tilmann Rabl, Milind Bhandarkar, Meikel Poess, and Raghu Nambiar founded the Big Data Benchmark Community (BDBC) in 2012 with bi-weekly conference calls and an international series of Workshops on Big Data Benchmarking (WBDB). The purpose of these events is to bring users, developers and providers of Big Data solutions together to present and discuss performance related aspects of Big Data.

In April 2014 the BDBC joined the SPEC Research Group (SPEC RG) and formed a Big Data working group to continue its work in the more formal setting of SPEC (see http://research.spec.org/working-groups/big-data-working-group.html). The mission of the newly founded group aligns with that of BDBC, namely to facilitate research and to engage industry leaders for defining and developing performance methodologies of Big Data applications.

In this lightning talk we will introduce the SPEC RG Big Data Working Group including its history, mission, and upcoming activities in 2015. Specifically, the group is continuing its efforts into further developing the BigBench Big Data benchmark by adding novel bench-mark elements (e.g. graph processing, different use cases, and a more sophisticated update model), developing more reference implementations (e.g., Spark, Impala, Presto) and trans-lating the BigBench technical specification into SPEC benchmark wording. The group is also reaching out into broader research topics, such as solving the Deep Pocket Benchmarking Dilemma (i.e. avoiding dominance by a single institution or company that invests large hardware configurations causing large but often inefficient benchmark results), developing tools that can generate large scale, complex, domain agnostic, sensible and deterministic structured and unstructured data sets on massively parallel systems, and identifying research overlap and integration issues between \u201ctraditional\u201d data processing and emerging big data platforms. The 6th WBDB workshop will be held on June 16-17, 2015, in Toronto, Canada, and will feature the annual meeting of the SPEC RG Big Data Working Group.

Cuiping Pan / VA Palo Alto

Large scale whole genome and whole exome sequencing studies are increasingly being used to decode human evolution history, population stratification, as well as understand the genetic architecture of human traits and diseases. With the amount and diversity inherent in these data, a robust and scalable data handling solution is in need. Here we explored the utility of a cloud-based data analysis platform, Google BigQuery, to analyze 480 deeply sequenced (average genome coverage 50x) whole human genomes. We presented a data compression solution for reducing data amounts but retaining rich information from read mapping and variant calling. We then implemented in BigQuery a comprehensive set of metrics to monitor sample and variant data quality by examining variants at both the individual genome and population levels. Furthermore, we developed query solutions for mining the genotypes, biological functions, and medical information matching to clinical database records. Our experimentation demonstrates that BigQuery provides a robust, fast, interactive, and scalable solution for analyzing large amount of whole genome sequencing data.

Julian Hyde / Hortonworks

Apache Calcite is an open source query planning framework written in Java. It is used in open source projects including Apache Hive and Apache Drill, and in several commercial products. The project joined the Apache Incubator in May 2014 and just released version 1.0.

Calcite's core consists of relational algebra operators, transformation rules, and a Volcano-like planning engine. There is an optional SQL parser, validator, and JDBC driver. Users provide metadata, cost model, statistics, traits (physical properties) and additional operators and rules as plug-ins.

If you are building a data management system, whether or not it is relational, Calcite provides a solid foundation for query preparation and optimization.

Calcite's built-in rules handle complex tasks such as join ordering and algorithm selection; materialized view rewrite and recommendation; multi-phase query planning; and query optimization targeting hybrid engines, and parallel and distributed query execution.

Nicholas Nystrom / Director, Strategic Applications, Pittsburgh Supercomputing Center

High-performance computing (HPC) systems intended to serve broad communities have traditionally excluded support for persistent, modern databases. Two new resources at the Pittsburgh Supercomputing Center (PSC) reverse that trend to allow new kinds and scales of data analytics, data-driven workflows, and distributed application architectures.

PSC recently received a $9.65M National Science Foundation (NSF) award to create a uniquely capable data-intensive HPC system designed to empower new research communities, bring desktop convenience to HPC, expand campus access, and help researchers facing challenges in Big Data to work more intuitively. Called Bridges, the new system will consist of tiered, large-shared-memory resources with nodes having 12TB, 3TB, and 128GB each, dedicated nodes for database, web, and data transfer, high-performance shared and distributed data storage, Hadoop acceleration, powerful new CPUs and GPUs, and a new, uniquely powerful interconnect. Bridges will feature persistent database and web services to support gateways, collaboration, and new levels of access to data repositories. Bridges will also support a high degree of interactivity, gateways and tools for gateway-building, a very flexible user environment including widely-used software such as R, Java, Python, and MATLAB, and virtualization for hosting application-specific environments, enhancing reproducibility, and interoperating with clouds. Bridges will be a resource on XSEDE, NSFs Extreme Science and Engineering Discovery Environment, and will connect with other computational resources, data resources, and scientific instruments.

Leading into Bridges, the Data Exacell (DXC) is a pilot project that is already underway to create, deploy, and test software and hardware building blocks designed to support data-analytic capabilities for data-intensive scientific research. The DXC features capabilities representative of Bridges, on which researchers in fields that traditionally have not used HPC for example, genomics, causal discovery, multimedia event detection, the digital humanities, history, connectomics, cell imaging, radio astronomy, and linguistics are already making substantial progress.

Somalee Datta / Stanford

We are seeing massive changes in how human diseases are addressed by combining omics, microbiome, sensor and traditional medical records. By a conservative estimate, it is possible to collect 2 Pb of medically relevant data over the lifetime of a patient. Technology is finally driving this revolution, with a) instrument advances making measurements cheap and, b) Big Data scale secure analytics being easily accessible. Cost per human while still non-trivial is no longer a deal breaker. We will present Stanford's innovative approaches in driving this large scale healthcare data revolution.

Douglas J. Slotta

The genetic assemblies from over 60,000 human samples are stored in our Sequence Read Archive (SRA). Each of these assemblies represent the raw reads from sequencing the 3.5 billion base pairs of the human genome at an average of 30X coverage. This is efficiently stored in a custom format developed in-house, yet it is still over 2 petabytes of compressed data. While it is efficient to retrieve the genomic record for a given individual, it is very expensive to retrieve everyone's genomic record for a given location. To ameliorate this problem, it was decided that we should create an index in SciDB of the counts of reads for each location, for each sample. Due to the size of the data, which is 3 orders of magnitude greater than we had previously loaded into SciDB for our Genotype Archive, the usual text-based methods of loading into SciDB would not suffice. To that end, we created a custom operator to load the data, in parallel, directly into an array.

This improved the throughput for each sample from 30 minutes, down to 2 minutes. To further reduce the storage requirements, a reference-based encoding was employed and custom SciDB functions created to facilitate queries. Currently this resource supports the NCBI Beacon, which as part of the Global Alliance for Genomics \& Health, allows anyone around the world to query for the existence of an given allele in a specific location across all of our genomes in less than 30 seconds. This type of live query was heretofore impossible. In the future, this data set will allow the analysis of the outcome of the sequencing pipeline in an aggregate fashion. The intent is find those genomic regions which are poorly serviced by today's methodologies.

Joel Saltz / Cherith Chair of Biomedical Informatics

Integrative analyses of large scale spatio-temporal datasets play increasingly important roles in many areas of science and engineering. Our work is motivated by application scenarios involving complementary digital microscopy, radiology and \u201comic\u201d analyses in cancer research. We are developing tools that will enable researchers to study tumors, their structure, their genetics and protein expression - at microscopic scales. This is important in cancer research because cancer is a disease that involves complex interactions between cancer cells, surrounding and distant tissue. The computational objective is to leverage a coordinated set of image analysis, feature extraction and machine learning methods to integrate multi-scale biomedical information, predict disease progression and targeting therapies.

I will describe tools and methods our group has developed for extraction, management, and analysis of multi-scale image features along with the systems software methods for optimizing execution on high end CPU/GPU platforms. I will also describe how the tools and methods we develop generalize to other application scenarios that involve characterization, analysis, data mining and machine learning applied to large spatio-temporal datasets. Finally, I will briefly touch on how this work relates to broader exascale computing research efforts.

Kacper Surdy

While the Higgs boson was found, the research at CERN intensifies and continues. Search for dark matter and supersymmetry could not be done without significant storage capacity, computing power and cutting-edge analytics methods.

In particular control, logging and monitoring systems inject high frequency streams into our databases. Only the logging system of the recently restarted Large Hadron Collider (LHC) is expected to generate about 100 TB of data per year. Studies on updates and future accelerator programs are going to push the demands on our database systems even more.

This presentation will describe complementing traditional relational databases with scalable, large scale, open source solutions. The combination not only improves performance for current loads but allows applying new computing approaches. Further more, the benefits can be achieved in unprecedented cost/performance ratio without sacrificing reliability.
Eric Tschetter

Sketches provide many ways to improve the performance of large scale data processing systems by introducing a bit of error into the result. This trade-off is often well deserved, but, because of our innate irrationalities as human beings, doesn't get the attention and respect it deserves. This talk will quickly provide a glimpse into why exact answers are overrated and why approximation algorithms, like sketches, are well worth it.

Join XLDB on LinkedIn
Follow XLDBConf on Twitter
Platinum Sponsors

Planet OS Logo

Teradata Logo

Gold Sponsors

snowflake logo

Silver Sponsors

MonetDB Logo

Synthos Technologies Logo

Vertica Logo


Gordon and Betty Moore Foundation Logo

Stanford University does not purchase or endorse any sponsors’ products or services.
Privacy Statement -