XLDB - Extremely Large Databases

Accepted Poster-only Abstracts

Emad Soroush & Magdalena Balazinska / University of Washington

We present TimeArr, a new storage manager for an array database. TimeArr enables users to create versions of their arrays and to explore these versions through time travel operations that include the selection of a specific version of a (sub)-array and a more general extraction of a (sub)-array provenance, in the form of a series of (sub)-array versions. To speed-up array exploration, TimeArr also introduces two techniques. The first is the notion of approximate time travel with three types of operations: approximate version selection, approximate provenance, and version metadata browsing. For these operations, users can tune the degree of approximation tolerable and thus trade-off accuracy and performance in a principled manner. The second is to lazily create short connections, called skip links, between the same (sub)-arrays at different versions with similar data patterns to speed up the selection of a specific version. TimeArr includes array specific storage techniques to efficiently support both precise and approximate operations.

Yu Cheng and Florin Rusu / University of California at Merced

Data generated through scientific experiments and measurements has unique characteristics that require special processing techniques. The size of the data is extremely large, with terabytes of new data generated daily. The structure of the data is diverse, not only relational, but rather multi-dimensional arrays with different degrees of sparsity. Analyzing scientific data requires complex processing that can be seldom expressed in a declarative language such as SQL.

In this paper, we introduce EXTASCID, a parallel system targeted at efficient analysis of large-scale scientific data. EXTASCID provides a flexible storage manager that supports natively both relational data as well as high-dimensional arrays. Any processing task is specified using a generic interface that extends the standard UDA mechanism. Given a task and the corresponding input data, EXTASCID manages the execution in a highly-parallel environment with data partitioned across multiple processing nodes completely transparent to the user. We evaluate the expressiveness of the task specification interface and the performance of our system by implementing the SS-DB benchmark, a standard benchmark for scientific data analysis. In terms of expressiveness, EXTASCID allows the scientific user to focus on the core logic of the processing, executed entirely by the system, near the data, without the need to handle complicated operations at the application level. Our experimental results on the SS-DB benchmark confirm the remarkable performance of EXTASCID when handling large amounts of data.

Claudio Bisegni, Matteo Mara and Antonello Paoletti / INFN-LNF

!CHAOS is a new concept of control system and data acquisition framework under development at INFN in Italy.
It provides, with a high level of abstraction, all the needed services for controlling and managing a large scientific infrastructure.
!CHAOS includes a history data service(HDS) for a continuous acquisition of operating data pushed by devices controllers.
The history engine(HE) implements the abstraction layer for the underneath storage technology and the logics for indexing and querying data. The engines' drivers are designed to run specific purposes (Indexing, Caching, Storing) linking the !CHOAS standard calls to the chosen third-party software. Indeed, this HE allows to route to independent channels the different !CHAOS services' data flow in order to improve the global efficiency of the whole storage system.

Iran Hutchinson / InterSystems

We've seen the growth of data systems to tackle the use cases in big data. (Extending the mature data systems and building new data systems). What lessons have we learned and what lessons are we not applying? Why even talk about transactions?

James Hughes / Huawei

First a trend in storage systems away from tape and Posix to alternative systems that favor scale, latency tolerance, and the integration of storage backup processes. This will include device and system trends. Second, how the features of the OpenStack/Swift and Huawei storage system meet these trends, and finally a little about the testing that has been completed in CERN OpenLab.

Hideyuki Kawashima / University of Tsukuba

With the advances of sensing technologies, massive amount of scientific data have been collected and analyzed in these days. LSST generates 20 TB per night, LHC generates 15 PB per year, and an earth observation satellite sensor data related to GEO-Grid [1] has generated more than 170 TB so far. ASTER sensor data are high resolution image data provided GEO-Grid. ASTER data provides three kinds of data. One of them is thermal infrared radiometer, which is shortly denoted as TIR. Originally, TIR data are utilized to discover mineral resources and to observe status of atmosphere and the surface of ground and sea. The size of each data is almost 800 × 800 pixels, and each pixel covers 90 m × 90 m of area. Unlike the above cases, we believe that TIR can be used to detect hot spots all over the world. The meaning of “hot spots” is places that generate thermals which include steel plants or fires. This paper proposes a threshold based method and a statistics based method to discover hot spots from TIR data. We implement our methods with SciDB [2]. All of procedures in our methods are implemented by array manipulation operators which are natively supported by SciDB and UDFs. The result of experiments which detect steel plants shows that statistics based method outperforms threshold based method as for recall.

[1] GeoGRID. http://www.geogrid.org
[2] SciDB. http://www.scidb.org

George Ostrouchov and Wei-Chen Chen / Oak Ridge National Laboratory, Drew Schmidt and Pragnesh Patel / University of Tennessee

Current practice in R focuses mostly on multicore desktop platforms and on small clusters with a manager/workers parallel programming style. Much of this focus comes from R's emphasis on interactive data analysis, which is also one of R's great strengths and a source of its popularity. HPC, on the other hand, has a focus on batch processing and Single Program Multiple Data (SPMD) parallel programming style. We take note of this and conduct research to enable SPMD style programming in R for truly big data distributed across a large platform. Our initial results show that we can use R syntax that is largely unchanged from serial programming and the distributed parallel support is provided through classes and methods. The R developer can access and control parallel readers, the data distribution, and data redistribution. A portion of the distributed functionality is provided by tightly coupling R with PBLAS, ScaLAPACK, and the BLACS. In many cases, we are able to perform sequences of big data matrix operations without data redistribution. Our intent is to provide tools to the R developer community so that the number SPMD analysis codes in R for truly big data is greatly expanded. Currently, we have three packages that are pending release: pbdMPI - a more intuitive and faster R interface to MPI, pbdSLAP - connecting scalable linear algebra libraries to R, pbdDMAC - classes and methods for distributed matrix algebra in R. Our plans include a client/server interface to preserve the interactive nature of R, where the server runs SPMD analysis codes on big data and the client operates with reduced data representations.

Djellel Eddine Difallah and  Philippe Cudre-Mauroux / eXascale Infolab
Statos Idreos and Ying Zhang / CWI

SSDB is a science-oriented benchmark that has been proposed and discussed at XLDB on several occasions. This poster will review the SSDB data and queries, and will present recent performance results of SSDB running on different platforms ranging form a classical relational engine to a column-store, an array processing layer, and a native array store. We will discuss the various implementations and results in detail, and highlight the strengths and weaknesses of each platform running the different queries in terms of ease of implementation, scalability, and efficiency.

Lavanya Ramakrishnan, Yushu Yao and Shane Canon / LBNL
Elif Dede / SUNY

Scientific user facilities such as the Advance Light Source and Joint Genome Institute are facing a deluge of data. Traditional data management and analysis methods are insufficient due to the data volumes, data sizes, rate of arrival of data and required response time. Big data technologies such as NoSQL databases and Hadoop have been used widely for internet data but very little is known about its use for scientific data storage and analysis. In this poster, we will present early results from running both synthetic benchmarks as well as applications on a number of different big data tools. Specifically, we will present a) performance and feature comparison of different data stores for scientific data b) use of MongoDB, a document store as a metadata store for light source data c) performance evaluation of using a NoSQL store as a backend data store for Hadoop analysis jobs.

Jee Vang and Yohan Lee / Booz Allen Hamilton

Previous approaches to computing pairwise mutual information towards the purpose of reconstructing gene regulatory networks have been reported. The most time-consuming and computationally expensive step in these approaches involves computing the mutual information (MI) matrix. We develop a new method to compute the MI matrix. Our method differs from previous ones by using a Massively Parallel Programming (MPP) based on MapReduce (MR). Furthermore, our MR method differs from other similar MR methods in that we require only 1 MR step to compute the MI matrix (as opposed to two). In benchmarking our approach to previous ones on the same data, we reduce the computational time from several hours to under one hour. Our method was tested on a cluster of 10 c1.medium Elastic MapReduce (EMR) nodes using Amazon Web Service (AWS). Implementing MR with AWS represents a cost-effective option for bioinformaticians to pursue analyses that were previously intractable due to “big O” complexity. We anticipate similar methods to trigger diverse and mature transcriptome analyses.

Ari Weil / Akiban Technologies

Big Data is defined as “a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time”. In the world of operational systems, the ‘tolerable’ Time-To-Interaction (TTI) has gone from 8 seconds in 2006 to 2 seconds in 2009 to a mere 250 milliseconds in 2011 - the blink of an eye. The tolerable window for data access is 150 milliseconds or less.

Meanwhile, value creation is moving closer to real-time and closer to the transaction itself: purchasing decisions are moving to deferred platforms where buyers compare offers in real-time; logistics are optimized in real-time based on wholesale inventory movement; content is rendered based on instant feedback of user behavior and its social context. Not all big data is schemaless, operational big data is complex, structured (“normalized”) and resides in heavily utilized transactional systems.

Practitioners need to augment their popular tools of Hadoop and NoSQL. David McFarlane, CEO at Akiban Technologies, will present a new innovation for real-time processing of operational big data without the expense and latency of an Extract-Transform-Load (ETL) process; without any change to code, architecture or infrastructure; and without the need for new programming paradigms or data scientists to wield them.

Peter Haas and Yannis Sismanis / IBM Research - Almaden

An important source of big data is deep predictive simulation models used in e-science and, increasingly, in guiding investment policy decisions around highly complex issues such as population health and safety. IBM’s Splash project provides a platform for combining heterogeneous simulation models and datasets across a broad range of disciplines to capture the behavior of complex systems of systems. Splash loosely couples models via data exchange, where each submodel, e.g., an agent-based simulation model of regional traffic patterns or of disease epidemics, often produces or expects massive time series having huge numbers of time points and many data values per time point. If the time-series output of one ""source"" submodel is used as input for another ""target"" submodel and the time granularity of the source is coarser than that of the target, an interpolation operation is required. Cubic-spline interpolation is the most widely-used method because of its smoothness properties. Scalable methods are needed for such data transformations, because the amount of data produced by a simulation program can be massive when simulating large, complex systems, especially when the time dimension is modeled at high resolution.

We examine whether such large-scale data transformations can be executed on MapReduce environments rather than the traditional reliance on supercomputers to interconnect HPC applications. We demonstrate that we can efficiently perform cubic spline interpolation over a massive time series in a MapReduce environment using novel algorithms based on adapting the distributed stochastic gradient descent (DSGD) method of Gemulla et al., originally developed for low-rank matrix factorization. Specifically, we adapt DSGD to calculate the spline coefficients that appear in the interpolation formula by solving a massive tridiagonal system of linear equations. Our techniques are potentially applicable to both spline interpolation and parallel solution of diagonal linear systems in other massively parallel data-analysis environments.

Join XLDB on LinkedIn
Follow XLDBConf on Twitter
Platinum Sponsors

Metascale Logo

Vertica Logo

Gold Sponsors

LinkedIn Logo

Exxon Mobil Logo

IBM Logo

Intersystems Logo

chevron Logo

Silver Sponsors

Cloudera Logo

SciDB and Paradigm4 Logos

Hypertable Logo

MonetDB Logo


Gordon and Betty Moore Foundation Logo

Stanford University does not purchase or endorse any sponsors’ products or services.
Privacy Statement -