We present TimeArr, a new storage manager for an array database. TimeArr enables users to create versions of their arrays and to explore these versions through time travel operations that include the selection of a specific version of a (sub)-array and a more general extraction of a (sub)-array provenance, in the form of a series of (sub)-array versions. To speed-up array exploration, TimeArr also introduces two techniques. The first is the notion of approximate time travel with three types of operations: approximate version selection, approximate provenance, and version metadata browsing. For these operations, users can tune the degree of approximation tolerable and thus trade-off accuracy and performance in a principled manner. The second is to lazily create short connections, called skip links, between the same (sub)-arrays at different versions with similar data patterns to speed up the selection of a specific version. TimeArr includes array specific storage techniques to efficiently support both precise and approximate operations.
Data generated through scientific experiments and measurements
has unique characteristics that require special processing
techniques. The size of the data is extremely large, with terabytes
of new data generated daily. The structure of the data is diverse,
not only relational, but rather multi-dimensional arrays with
different degrees of sparsity. Analyzing scientific data requires
complex processing that can be seldom expressed in a declarative
language such as SQL.
In this paper, we introduce EXTASCID, a parallel system targeted at
efficient analysis of large-scale scientific data. EXTASCID provides
a flexible storage manager that supports natively both relational
data as well as high-dimensional arrays. Any processing task is
specified using a generic interface that extends the standard UDA
mechanism. Given a task and the corresponding input data, EXTASCID
manages the execution in a highly-parallel environment with data
partitioned across multiple processing nodes completely transparent
to the user. We evaluate the expressiveness of the task
specification interface and the performance of our system by
implementing the SS-DB benchmark, a standard benchmark for
scientific data analysis. In terms of expressiveness, EXTASCID
allows the scientific user to focus on the core logic of the
processing, executed entirely by the system, near the data, without
the need to handle complicated operations at the application level.
Our experimental results on the SS-DB benchmark confirm the
remarkable performance of EXTASCID when handling large amounts of
data.
!CHAOS is a new concept of control system and data acquisition
framework under development at INFN in Italy.
It provides, with a high level of abstraction, all the needed
services for controlling and managing a large scientific
infrastructure.
!CHAOS includes a history data service(HDS) for a continuous
acquisition of operating data pushed by devices controllers.
The history engine(HE) implements the abstraction layer for the
underneath storage technology and the logics for indexing and
querying data. The engines' drivers are designed to run specific
purposes (Indexing, Caching, Storing) linking the !CHOAS standard
calls to the chosen third-party software. Indeed, this HE allows to
route to independent channels the different !CHAOS services' data
flow in order to improve the global efficiency of the whole storage
system.
We've seen the growth of data systems to tackle the use cases in big data. (Extending the mature data systems and building new data systems). What lessons have we learned and what lessons are we not applying? Why even talk about transactions?
First a trend in storage systems away from tape and Posix to alternative systems that favor scale, latency tolerance, and the integration of storage backup processes. This will include device and system trends. Second, how the features of the OpenStack/Swift and Huawei storage system meet these trends, and finally a little about the testing that has been completed in CERN OpenLab.
With the advances of sensing technologies, massive amount of
scientific data have been collected and analyzed in these days. LSST
generates 20 TB per night, LHC generates 15 PB per year, and an
earth observation satellite sensor data related to GEO-Grid [1] has
generated more than 170 TB so far. ASTER sensor data are high
resolution image data provided GEO-Grid. ASTER data provides three
kinds of data. One of them is thermal infrared radiometer, which is
shortly denoted as TIR. Originally, TIR data are utilized to
discover mineral resources and to observe status of atmosphere and
the surface of ground and sea. The size of each data is almost 800 ×
800 pixels, and each pixel covers 90 m × 90 m of area. Unlike the
above cases, we believe that TIR can be used to detect hot spots all
over the world. The meaning of “hot spots” is places that generate
thermals which include steel plants or fires. This paper proposes a
threshold based method and a statistics based method to discover hot
spots from TIR data. We implement our methods with SciDB [2]. All of
procedures in our methods are implemented by array manipulation
operators which are natively supported by SciDB and UDFs. The result
of experiments which detect steel plants shows that statistics based
method outperforms threshold based method as for recall.
[1] GeoGRID. http://www.geogrid.org
[2] SciDB. http://www.scidb.org
Current practice in R focuses mostly on multicore desktop platforms and on small clusters with a manager/workers parallel programming style. Much of this focus comes from R's emphasis on interactive data analysis, which is also one of R's great strengths and a source of its popularity. HPC, on the other hand, has a focus on batch processing and Single Program Multiple Data (SPMD) parallel programming style. We take note of this and conduct research to enable SPMD style programming in R for truly big data distributed across a large platform. Our initial results show that we can use R syntax that is largely unchanged from serial programming and the distributed parallel support is provided through classes and methods. The R developer can access and control parallel readers, the data distribution, and data redistribution. A portion of the distributed functionality is provided by tightly coupling R with PBLAS, ScaLAPACK, and the BLACS. In many cases, we are able to perform sequences of big data matrix operations without data redistribution. Our intent is to provide tools to the R developer community so that the number SPMD analysis codes in R for truly big data is greatly expanded. Currently, we have three packages that are pending release: pbdMPI - a more intuitive and faster R interface to MPI, pbdSLAP - connecting scalable linear algebra libraries to R, pbdDMAC - classes and methods for distributed matrix algebra in R. Our plans include a client/server interface to preserve the interactive nature of R, where the server runs SPMD analysis codes on big data and the client operates with reduced data representations.
SSDB is a science-oriented benchmark that has been proposed and discussed at XLDB on several occasions. This poster will review the SSDB data and queries, and will present recent performance results of SSDB running on different platforms ranging form a classical relational engine to a column-store, an array processing layer, and a native array store. We will discuss the various implementations and results in detail, and highlight the strengths and weaknesses of each platform running the different queries in terms of ease of implementation, scalability, and efficiency.
Scientific user facilities such as the Advance Light Source and Joint Genome Institute are facing a deluge of data. Traditional data management and analysis methods are insufficient due to the data volumes, data sizes, rate of arrival of data and required response time. Big data technologies such as NoSQL databases and Hadoop have been used widely for internet data but very little is known about its use for scientific data storage and analysis. In this poster, we will present early results from running both synthetic benchmarks as well as applications on a number of different big data tools. Specifically, we will present a) performance and feature comparison of different data stores for scientific data b) use of MongoDB, a document store as a metadata store for light source data c) performance evaluation of using a NoSQL store as a backend data store for Hadoop analysis jobs.
Previous approaches to computing pairwise mutual information towards the purpose of reconstructing gene regulatory networks have been reported. The most time-consuming and computationally expensive step in these approaches involves computing the mutual information (MI) matrix. We develop a new method to compute the MI matrix. Our method differs from previous ones by using a Massively Parallel Programming (MPP) based on MapReduce (MR). Furthermore, our MR method differs from other similar MR methods in that we require only 1 MR step to compute the MI matrix (as opposed to two). In benchmarking our approach to previous ones on the same data, we reduce the computational time from several hours to under one hour. Our method was tested on a cluster of 10 c1.medium Elastic MapReduce (EMR) nodes using Amazon Web Service (AWS). Implementing MR with AWS represents a cost-effective option for bioinformaticians to pursue analyses that were previously intractable due to “big O” complexity. We anticipate similar methods to trigger diverse and mature transcriptome analyses.
Big Data is defined as “a term applied to data sets whose size is
beyond the ability of commonly used software tools to capture,
manage, and process the data within a tolerable elapsed time”. In
the world of operational systems, the ‘tolerable’
Time-To-Interaction (TTI) has gone from 8 seconds in 2006 to 2
seconds in 2009 to a mere 250 milliseconds in 2011 - the blink of an
eye. The tolerable window for data access is 150 milliseconds or
less.
Meanwhile, value creation is moving closer to real-time and closer
to the transaction itself: purchasing decisions are moving to
deferred platforms where buyers compare offers in real-time;
logistics are optimized in real-time based on wholesale inventory
movement; content is rendered based on instant feedback of user
behavior and its social context. Not all big data is schemaless,
operational big data is complex, structured (“normalized”) and
resides in heavily utilized transactional systems.
Practitioners need to augment their popular tools of Hadoop and
NoSQL. David McFarlane, CEO at Akiban Technologies, will present a
new innovation for real-time processing of operational big data
without the expense and latency of an Extract-Transform-Load (ETL)
process; without any change to code, architecture or infrastructure;
and without the need for new programming paradigms or data
scientists to wield them.
An important source of big data is deep predictive simulation
models used in e-science and, increasingly, in guiding investment
policy decisions around highly complex issues such as population
health and safety. IBM’s Splash project provides a platform for
combining heterogeneous simulation models and datasets across a
broad range of disciplines to capture the behavior of complex
systems of systems. Splash loosely couples models via data exchange,
where each submodel, e.g., an agent-based simulation model of
regional traffic patterns or of disease epidemics, often produces or
expects massive time series having huge numbers of time points and
many data values per time point. If the time-series output of one
""source"" submodel is used as input for another ""target"" submodel
and the time granularity of the source is coarser than that of the
target, an interpolation operation is required. Cubic-spline
interpolation is the most widely-used method because of its
smoothness properties. Scalable methods are needed for such data
transformations, because the amount of data produced by a simulation
program can be massive when simulating large, complex systems,
especially when the time dimension is modeled at high resolution.
We examine whether such large-scale data transformations can be
executed on MapReduce environments rather than the traditional
reliance on supercomputers to interconnect HPC applications. We
demonstrate that we can efficiently perform cubic spline
interpolation over a massive time series in a MapReduce environment
using novel algorithms based on adapting the distributed stochastic
gradient descent (DSGD) method of Gemulla et al., originally
developed for low-rank matrix factorization. Specifically, we adapt
DSGD to calculate the spline coefficients that appear in the
interpolation formula by solving a massive tridiagonal system of
linear equations. Our techniques are potentially applicable to both
spline interpolation and parallel solution of diagonal linear
systems in other massively parallel data-analysis environments.