XLDB - Extremely Large Databases

Accepted Lightning Talk Abstracts



Increasingly, users of big-data analytics systems are experts in some domain area (e.g., domain sciences) but are not necessarily experts in large-scale data processing. Today's big-data analytics systems, however, remain difficult to use even by experts. In the CQMS project, we address important barriers to making big-data analytics more seamless. In this talk, we will present a quick overview of some of the key challenges and our recent findings in the area of big-data analytics usability. First, we will present the SnipSuggest system, which is an autocompletion tool for SQL queries, aimed to ease the query formulation process. It provides on-the-go, context-aware assistance in the query composition process. Second, we will present our Sample-based Interactive Querying (SIQ) system, which automatically selects a ‘good’ small sample of an underlying large input database to facilitate the process of debugging SQL queries over massive datasets. Finally, we will describe, PerfXplain, a tool for explaining the performance of a MapReduce job running on a shared-nothing cluster. PerfXplain aims at helping users understand the performance they are getting from their analytics engine.


Stephen Brobst / Teradata Corporation

XLDB implementations require both extreme performance and extreme cost effectiveness. While it may be feasible for small scale data problems to store content 100% in-memory, this approach is cost prohibitive for XLDB deployment. This talk discusses the design techniques and underlying technologies necessary to simultaneously optimize for lowest cost per I/O as well as lowest cost per terabyte of data stored. Measured performance results will be reported.


Ying Zhang / CWI
Bart Scheers / CWI and UvA
Martin Kersten / CWI

Data-intensive scientific research, such as in astronomy, calls for functional enhancements to DBMS technologies. To brige the gap between the needs of the Data-Intensive Research fields and the current DBMS technologies, we have introduced SciQL (pronounced as ‘cycle’), a novel SQL-based array query language for scientific applications with both tables and arrays as first class citizens. SciQL lowers the entrance fee of adopting relational DBMS in scientific domains, because it includes functionality often only found in mathematics software packages.

In this talk, we demonstrate SciQL using examples taken from a real-life astronomical data processing system, e.g. the Transient Key Project (TKP) of the LOFAR radio telescope. In particular, we show how a a full Stokes spectral light-curve database of all detected sources can be constructed, by cross-correlation over multiple catalogues can be constructed. By exposing the properties of array data to the relational DBMS, SciQL also opens up many opportunities to enhance the data mining possibilities for real-time transient and variability searches.


Peter Baumann / Jacobs University

The EarthServer initiative, comprised by a transatlantic consortium, has set out to establish comprehensive ad-hoc analytics support for massive spatio-temporal Earth science data. Examples include 1D timeseries, 2D satellite imagery, 3D x/y/t satellite image timeseries and x/y/z exploration data, and 4D x/y/z/t climate and ocean data. Based on a declarative, standardized array query language with extenive server-side optimization, six Lighthouse Applications covering all Earth Sciences plus Planetary geology are established to allow distributed cross-dimensional data fusion on 600+ TB data archives.

We present the project strategy, the array DBMS platform, and the technical challenges addressed and results achieved. Further, we discuss how this technology has been applied in the Space and Life sciences as well.


Timothy M. Shead, David H. Rogers, Patricia J. Crossno / SNL

Scientists generate and organize their data in ways that are unruly, poorly-defined and constantly changing - a riot of structures and formats of varying organization, quality, and correctness. Despite this, we should expect to explore results interactively, exploring features, trends, and anomalies in ways that support intuitive engagement with the data.

Taking inspiration from search engine spiders that must handle similar complexity within the World Wide Web, we are developing araXne: a tool to catalog scientific data, which scans, indexes, queries, and retrieves it in storage- and schema-agnostic ways, adapting to the way scientists store their data instead of dictating formats or structure.

Cataloging is necessary to deal with increased data size and complexity, time and energy constraints forcing reductions in data movement, and specialized scientific data formats (HDF, NetCDF) limiting query capabilities. We advocate for a dynamic external data catalog, which separates data storage functionality from indexing and retrieval.

Just as cards in a library’s card catalog offer multiple representations (title, author, subject) of underlying documents, metadata extracted by araXne present different perspectives on units of scientific data, such as tables, time series, images, animations, and meshes. Each perspective contains metadata and aggregated or summarized data appropriate to its type. Once perspectives have been extracted in-situ from a collection of data, they are sent to a separate host for indexing and storage. Users and tools interact with this ""virtual card catalog,"" scanning and querying collections of araXne perspectives to generate user interfaces, preview underlying data, and specify analysis inputs.

Currently, we are using a Hadoop + HBase stack to store araXne perspectives, exploring how we can best utilize the sparse table structure for storage and high-throughput map-reduce functionality for indexing. Further, we are researching domain specific languages for higher-level querying, grouping, and subsetting of perspectives.


There has been a substantial increase in the amount of digital data collected over the last several decades. Together with this data comes the desire to transform it, through analysis, into useful information. Traditional analytic tools used for analysis fail when the size of the data grows large, and general-purpose database systems used to manage large collections of data cannot perform the sophisticated analyses required. The analyst who wants to work with large datasets faces a dilemma when it comes to tool choice.
Our work resolves this tension with a hybrid strategy that integrates R and SciDB. R is a powerful data analysis tool, and SciDB is an array database management system. Our integration focuses on the automated movement of data between the two systems, in an effort to improve performance. Contributions include semantic mappings between the two languages, a cost-based interaction model, a start-to-finish system implementation, and test results quantifying the performance of the hybrid approach.


Chaitan Baru / SDSC
Milind Bhandarkar / Greenplum
Raghunath Nambiar / Cisco
Meikel Poess / Oracle
Tilmann Rabl / University of Toronto

The Workshop on Big Data Benchmarking (WBDB2012), held on May 8-9, 2012 in San Jose, CA, marked the first of a series of workshops aimed at developing industry-standard Big Data benchmarks. The workshop was attended by 60 invitees representing 45 different organizations both from industry and academia. Attendees were chosen based on their experience and expertise in one or more areas of Big Data, database systems, performance benchmarking, and big data applications. They agreed that there was both a pressing need and an opportunity for defining benchmarks to capture the end-to-end aspects of big data applications. In presentations and working sessions the workshop participants laid the foundation for future workshops by agreeing on key aspects of future Big Data benchmarks, e.g. the need to include metrics for performance as well as price/performance; the need to consider several costs, including total system cost, setup cost, and energy costs; and the need for an end-to-end benchmark serving the purposes of competitive marketing as well as product improvement.

As a result of this meeting an informal “Big Data benchmarking community” has been formed, hosted by the San Diego Supercomputer Center, UC San Diego. Biweekly phone conferences are being used to keep this group engaged and to share information among members. Several benchmarking activities were started by members of this community. These range from simple examples of end-to-end benchmarks to complex components setups. Within our lightning talk and poster, we will introduce the Big Data Benchmarking Community and give an overview of current developments in Big Data benchmarking.

The next Big Data Benchmarking workshops are currently planned for December 17-18, 2012 in Pune, India and June 2013 in Xian, China.


Douglas J. Slotta / NCBI

With the advent of large scale sequencing projects, such as the 1000 Genomes Project, it has become increasingly difficult to store this data in traditional relational form. Genotype data may be visualized as a sparse matrix, unbounded in each direction, where the rows represent the variations and the columns are the individuals. Each element therefore contains the genotype information for an individual at a given variation location. In addition to the genotype, there are also quality scores, read depths, likelihoods, and other measures of the genotype call. There are approximately 39.6 million known variation sites on the human genome, and the 1000 Genomes project has sampled over 1200 people, with more to come. This is potentially trillions of data points, which will need to be accessed either by variation, by sample, or some combination thereof. In addition to storing and accessing the data, there is a need to compute aggregate values, such as frequency counts, over all of the data. This needs to be done for predetermined population subsets, as well as those subsets chosen on the fly. SciDB is a new open source array-based database, which they refer to as Data Management and Analytics Software System (DMAS). A prototype genotype server, developed using SciDB, was successfully used as the back-end for the genotype portion of the 1000 Genome Browser. Currently, the Genotype Archive is moving beyond the prototype stage, and is intended to be the central storage location of the underlying genotype data for all variation resources at the National Center for Biotechnology Information (NCBI).


Benoit Sigoure / StumbleUpon

OpenTSDB is a distributed, scalable Time Series Database (TSDB) that runs on top of HBase. It was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable.

This lightening talk will explain how OpenTSDB is used and deployed at StumbleUpon, and why it's so easy for us to add well over a billion data point per day in a database that currently stores around half a trillion data points. We'll show how we separated the read queries from the writes, and how load balancing and caching works in practice.


Justin Simonds and TC Janes, HP Master Technologists

Although most discussion today revolve around Big Data, Social Media and unstructured data most of the world is still using and will continue to use structured analytics - Enterprise Data Warehouse and Operational Data Stores. The Analytic Cloud is a discussion on incorporating all these components by having an intelligent query router that would provide several functions. First to determine where queries should be routed (that is which platform is best suited to recieve and process the query). Second to provide real-time datastream analysis for urgent analytics. Finally to provide an in-memory database for redundant queries (for speed). Queries would enter the analytic cloud and either be responded to by the intelligent query router or direct the query to the EDW, ODS, Big Data analytic engine or unstructured analysis (Hadoop/Autonomy). Submitters would not know or care which backend system responds to the query.


Arup De / UCSD and LLNL
Maya Gokhale / LLNL
Steve Swanson / UCSD
Rajesh Gupta / UCSD

Sequence alignment is becoming a cardinal operation of many bioinformatics applications such as DNA or protein sequencing, multiple sequence alignment and genome assembly. With increasing computation power, those applications are increasingly I/O limited due to the large size of biological databases (e.g. GenBank, EMBL, DDBJ and UniProt), complexity, and poor performance of existing storage technologies such as disk and flash. Emerging fast, byte-addressable non-volatile memory (NVM) technologies such as phase-change memory (PCM) and spin-torque transfer memory (STTM) are very promising and are approaching DRAM-like performance with lower power consumption and higher density as process technology scales. However, the existing solid-state drive (SSD) architecture optimizes for flash characteristics and is not adequate to exploit the full potential of NVMs due to architectural and I/O interface (e.g. PCIe, SATA) limitations. To eliminate I/O and better utilize NVM technologies, we propose a novel active storage architecture called Maverick. It extends conventional SSD architecture by offloading bioinformatics operations to exploit the low latency and high bandwidth of NVMs and significantly reduce the data traffic between the host and the storage. We provide a runtime library that enables the programmer to dispatch computations to the active storage system without dealing with the complications of the underlying architecture and communication management. This library ensures thread-safety and a consistent view of data both from application running on the host and in the storage device.

We built a prototype of Maverick on the BEE3 FPGA platform and use UniProt Knowledgebase Release 2012/02 protein database sequences for our evaluation. We compare the performance and energy efficiency with the state-of-art PCIe-attached PCM-based SSD. Maverick outperforms PCM-based SSD by up to 4 X on performance and up to 6.5 X on energy efficiency. We also compare Maverick's performance with existing storage arrays such as RAID-disk and flash-based SSD, and achieve more than 19 X and 6.8 X speedup as compared to the RAID-disk and the flash-based SSD respectively. This huge improvement can be attributed to the reduction of data transfer from the storage to the host, the elimination of I/O and the efficient data processing within the active storage system.


Subha Madhavan, Michael Harris, Krithika Bhuvaneshwar, Andrew Shinohara, Kevin Rosso, Yuriy Gusev / Georgetown University

The new and emerging field of Systems Medicine, an application of Systems Biology approaches to biomedical problems in the clinical setting, leverages complex computational tools and high dimensional large data sets to derive personalized assessments of disease risk. The Georgetown Clinical & Omics Development Engine (G-CODE) is a generic and flexible web-based platform that serves to enable basic, translational, and clinical research activities by integrating patient characteristics and clinical outcome data with a variety of large-scale high throughput research data in a unified environment to enable Systems Medicine. Through G-CODE, we hope to help enable the creation of new disease-centric portals and the widespread use of biomedical informatics tools by basic, clinical, and translational researchers for large ‘omics’ data sets.

This infrastructure was first deployed in the form of the Georgetown Database of Cancer (G-DOC®; http://gdoc.georgetown.edu), which includes a broad collection of bioinformatics and systems biology tools for analysis and visualization of four major “omics” types: DNA, mRNA, microRNA, and metabolites. One obvious area for expansion of the G-CODE/G-DOC platform infrastructure is for the support of next generation sequencing (NGS), a highly enabling and transformative emerging technology for the biomedical sciences. Nonetheless, effectively utilizing this data is impeded by the substantial handling, manipulation, and analysis requirements that it entails. We have concluded that these are gaps that cloud computing is well positioned to fill, as this type of infrastructure permits rapid scaling with low input costs. As such, the Georgetown University team is utilizing the Amazon EC2 Cloud to process whole exome, whole genome, RNA-seq, and ChIP-seq NGS data. The processed NGS data is being integrated within web portal to ensure that it can be analyzed in the full context of other “omics” data. Through technology re-use, the G-CODE infrastructure will accelerate progress for a variety of ongoing programs in need of integrative multi-omics analysis, and advance our opportunities to practice effective systems medicine in the near future.


Ashutosh Chauhan / Apache Software Foundation & Hortonworks

In this talk, I will compare and contrast Pig and Hive. From overall architeture perspective both systems exhibits remarkable similiarity. Both generate map-reduce jobs from a query written in higher level domain specific language. However, two systems have made number of different choices.
First of which is language. Pig exposes PigLatin, a dataflow language while Hive exposes more traditional SQL like declarative language HiveQL. This choice of language has resulted in adoption of two systems by different user communities. Hive's user base is analysts who are primarily interested in generating reports which summarizes data in interesting ways while Pig has stuck chord with programmers who are writing complex scripts to garner deep insights from data typically employing machine learning techniques.
Since these projects are open source this difference in communities is driving the two projects in different directions. First of it is feature set. Pig is moving towards a language of its own. Consequently, primary consumer of software is programmer writing pig-latin scripts. A successful language requires an ecosystem of tools like debugger, editor, linter, development environment etc. So, in future we may see Pig harboring these tool chains. Hive on the other hand due to its SQL like language can integrate with existing business intelligence tools through its odbc/jdbc interface thereby making it accessible to business users. Consequently, primary consumers of software are other tools which interact with Hive programatically. As a result focus in Hive is smooth interfaces for easier integration with other tools.

Other difference between two system is how they tackle physical and execution optimizations. While Hive tries to determine the applicability of such optimizations itself, Pig assumes programmer knows best and offers variety of optimizations from which programmer can pick and tell Pig.


Dan Maltbie / Annai Systems
Chris Wilks / UCSC

UC Santa Cruz is under contract with NIH and the National Cancer Institute to construct and operate CGHub, a nation-scale library and user portal for genomic data. This contract covers growth of the library to 5 Petabytes. The three NCI programs (TGGA, TARGET, and CGCI) that feed into the library currently submit about 20 terabytes of new data each month. We will review our handling of large (10's of GBs) files, metadata, and our performance experience with a fast novel transfer mechanism designed specifically for transferring large files across "long fat" networks based on the Genome Network Operating System, or GNOS(tm), technology provided by Annai Systems, Inc. We will also explain how we constructed the system to meet FISMA security level requirements.


Paul Pedersen / 10gen

This talk describes the application of mongodb at the CERN CMS experiment. The CMS experiment uses a general-purpose detector to investigate a wide range of physics, including the search for the Higgs boson, extra dimensions, and particles that could make up dark matter. Mongodb is used at CMS as a data aggregation system that includes mapping, caching, merging, and analytics.


Surendra Byna / Lawrence Berkeley National Laboratory

 Petascale plasma physics simulations have recently entered the regime of simulating trillions of particles. These unprecedented simulations generate massive amounts of data, posing significant challenges in storage, analysis, and visualization. In this talk and poster, we will present parallel I/O, analysis, and visualization results from a VPIC trillion particle simulation running on 120,000 cores of the NERSC Cray XE6 system “Hopper”, which produces ~32TB of data for a single timestep. We will explain the successful application of H5Part, a particle data extension of parallel HDF5, for writing the dataset at a significant fraction of system peak I/O rates. To enable efficient analysis, we develop hybrid parallel FastQuery to index and query data using multi-core CPUs on distributed memory hardware. We show good scalability results for the FastQuery implementation using up to 10,000 cores. Finally, we will present application of this indexing/query-driven approach to facilitate the first-ever analysis and visualization of the trillion-particle dataset. The newly developed query-based visualization techniques enabled scientists to gain insights into the relationship between the structure of the magnetic field and energetic particles and to investigate the agyrotropic distribution of particles near the magnetic reconnection hot-spot in a 3D trillion particle dataset.


Michael Fan and Rushan Chen / Zynga

Zynga's analytics started with core concepts like open access and massive scale. Zynga gambled with a newer product at the time called Vertica for its warehouse. Vertica uses MPP compressed column store technology. We choose Vertica because it offered ANSI SQL support, ease of integration with common reporting and ETL systems, common SQL skillset among employees, fast loading, data compression and fast query speeds. In order for this approach to work, Zynga centralized the data infrastructure team, business intelligence database team and analysts as one organization. The control of data flow was pushed out to an API tier so that semi-structure could be applied to the data as it was logged, thereby skipping much of the pre-processing required to add structure. Data is logged in a semi-structured state so that schema changes aren't required in general as tracking needs change. The central group also maintains the data models in a way that network and platform level questions can easily and quickly be asked. Instead of each group hiring its own analysts, analysts are embedded throughout the company but report into a central group. This was key to spreading effective use and knowledge of the analytical systems. The analyst group also funnels insights back to the rest of the company as a whole. Today, Zynga loads over 70 billion rows a day in real-time into its Vertica based system, and over 80 billion rows a day to its custom streaming event tier. These systems create the backbone of other analytical systems like an experiment A/B testing platform and real-time data services for games. Today, Zynga's big data systems and an analytically driven culture are viewed as a core piece of its competitive advantage.


Hiroyuki Makino / NTT Software Innovation Center

Jubatus is a large-scale, distributed, real-time analysis and machine learning framework with the objective of continuous, high-speed, deep analysis of high-volume data. Jubatus achieves continuous, high-speed processing of high-volume data by dividing the large-volume of data among multiple servers and processing it sequentially and in parallel. Deep analysis requires use of sophisticated statistical processing and machine learning, and implementation in a distributed environment requires a framework allowing multiple servers to share intermediate results. Such sharing requires frequent communication between servers, and this can become a bottleneck to overall performance if a suitable communication method is not devised.

There are many similar points between Hadoop/Mahout and Jubatus. These are scalable and run on commodity hardware. However, Hadoop is not equipped with sophisticated machine learning algorithms since most of the algorithms do not fit its MapReduce paradigm. Though Apache Mahout is also a Hadoop-based machine learning platform, online processing of data streams is still out of the scope.

Accordingly, Jubatus not only ensures the real-timeliness and accuracy of data analysis, but also increases robustness by exchanging the intermediate results among multiple servers in a loose manner and thereby reducing the communication overhead between servers. Jubatus uses a loose model sharing architecture for efficient training and sharing of machine learning models by defining three fundamental operations; Update, Mix, and Analyze, in a similar way with the Map and Reduce operations in Hadoop.
Our development team achieves combining the latest advances in online machine learning, distributed computing, and randomized algorithms to provide efficient machine learning features for Jubatus. Now Jubatus supports classification, regression, graph mining, and recommendation.

We demonstrate Jubatus example application and show the architecture that performs real-time SNS data analysis such as real-time filtering, fuzzy search, relevancy ranking and company reputation categorization.


Join XLDB on LinkedIn
Follow XLDBConf on Twitter
Platinum Sponsors

Metascale Logo

Vertica Logo

Gold Sponsors

LinkedIn Logo

Exxon Mobil Logo

IBM Logo

Intersystems Logo

chevron Logo

Silver Sponsors

Cloudera Logo

SciDB and Paradigm4 Logos

Hypertable Logo

MonetDB Logo

Donors

Gordon and Betty Moore Foundation Logo

Stanford University does not purchase or endorse any sponsors’ products or services.
Privacy Statement -