Increasingly, users of big-data analytics systems are experts in some domain area (e.g., domain sciences) but are not necessarily experts in large-scale data processing. Today's big-data analytics systems, however, remain difficult to use even by experts. In the CQMS project, we address important barriers to making big-data analytics more seamless. In this talk, we will present a quick overview of some of the key challenges and our recent findings in the area of big-data analytics usability. First, we will present the SnipSuggest system, which is an autocompletion tool for SQL queries, aimed to ease the query formulation process. It provides on-the-go, context-aware assistance in the query composition process. Second, we will present our Sample-based Interactive Querying (SIQ) system, which automatically selects a ‘good’ small sample of an underlying large input database to facilitate the process of debugging SQL queries over massive datasets. Finally, we will describe, PerfXplain, a tool for explaining the performance of a MapReduce job running on a shared-nothing cluster. PerfXplain aims at helping users understand the performance they are getting from their analytics engine.
XLDB implementations require both extreme performance and extreme
cost effectiveness. While it may be feasible for small scale data
problems to store content 100% in-memory, this approach is cost
prohibitive for XLDB deployment. This talk discusses the design
techniques and underlying technologies necessary to simultaneously
optimize for lowest cost per I/O as well as lowest cost per terabyte
of data stored. Measured performance results will be reported.
Data-intensive scientific research, such as in astronomy, calls
for functional enhancements to DBMS technologies. To brige the gap
between the needs of the Data-Intensive Research fields and the
current DBMS technologies, we have introduced SciQL (pronounced as
‘cycle’), a novel SQL-based array query language for scientific
applications with both tables and arrays as first class citizens.
SciQL lowers the entrance fee of adopting relational DBMS in
scientific domains, because it includes functionality often only
found in mathematics software packages.
In this talk, we demonstrate SciQL using examples taken from a
real-life astronomical data processing system, e.g. the Transient
Key Project (TKP) of the LOFAR radio telescope. In particular, we
show how a a full Stokes spectral light-curve database of all
detected sources can be constructed, by cross-correlation over
multiple catalogues can be constructed. By exposing the properties
of array data to the relational DBMS, SciQL also opens up many
opportunities to enhance the data mining possibilities for real-time
transient and variability searches.
The EarthServer initiative, comprised by a transatlantic
consortium, has set out to establish comprehensive ad-hoc analytics
support for massive spatio-temporal Earth science data. Examples
include 1D timeseries, 2D satellite imagery, 3D x/y/t satellite
image timeseries and x/y/z exploration data, and 4D x/y/z/t climate
and ocean data. Based on a declarative, standardized array query
language with extenive server-side optimization, six Lighthouse
Applications covering all Earth Sciences plus Planetary geology are
established to allow distributed cross-dimensional data fusion on
600+ TB data archives.
We present the project strategy, the array DBMS platform, and the
technical challenges addressed and results achieved. Further, we
discuss how this technology has been applied in the Space and Life
sciences as well.
Scientists generate and organize their data in ways that are
unruly, poorly-defined and constantly changing - a riot of
structures and formats of varying organization, quality, and
correctness. Despite this, we should expect to explore results
interactively, exploring features, trends, and anomalies in ways
that support intuitive engagement with the data.
Taking inspiration from search engine spiders that must handle
similar complexity within the World Wide Web, we are developing
araXne: a tool to catalog scientific data, which scans, indexes,
queries, and retrieves it in storage- and schema-agnostic ways,
adapting to the way scientists store their data instead of
dictating formats or structure.
Cataloging is necessary to deal with increased data size and
complexity, time and energy constraints forcing reductions in
data movement, and specialized scientific data formats (HDF,
NetCDF) limiting query capabilities. We advocate for a dynamic
external data catalog, which separates data storage
functionality from indexing and retrieval.
Just as cards in a library’s card catalog offer multiple
representations (title, author, subject) of underlying
documents, metadata extracted by araXne present different
perspectives on units of scientific data, such as tables, time
series, images, animations, and meshes. Each perspective
contains metadata and aggregated or summarized data appropriate
to its type. Once perspectives have been extracted in-situ from
a collection of data, they are sent to a separate host for
indexing and storage. Users and tools interact with this
""virtual card catalog,"" scanning and querying collections of
araXne perspectives to generate user interfaces, preview
underlying data, and specify analysis inputs.
Currently, we are using a Hadoop + HBase stack to store araXne
perspectives, exploring how we can best utilize the sparse table
structure for storage and high-throughput map-reduce
functionality for indexing. Further, we are researching domain
specific languages for higher-level querying, grouping, and
subsetting of perspectives.
There has been a substantial increase in the amount of
digital data collected over the last several decades. Together
with this data comes the desire to transform it, through
analysis, into useful information. Traditional analytic tools
used for analysis fail when the size of the data grows large,
and general-purpose database systems used to manage large
collections of data cannot perform the sophisticated analyses
required. The analyst who wants to work with large datasets
faces a dilemma when it comes to tool choice.
Our work resolves this tension with a hybrid strategy that
integrates R and SciDB. R is a powerful data analysis tool, and
SciDB is an array database management system. Our integration
focuses on the automated movement of data between the two
systems, in an effort to improve performance. Contributions
include semantic mappings between the two languages, a
cost-based interaction model, a start-to-finish system
implementation, and test results quantifying the performance of
the hybrid approach.
The Workshop on Big Data Benchmarking (WBDB2012), held on May
8-9, 2012 in San Jose, CA, marked the first of a series of workshops
aimed at developing industry-standard Big Data benchmarks. The
workshop was attended by 60 invitees representing 45 different
organizations both from industry and academia. Attendees were chosen
based on their experience and expertise in one or more areas of Big
Data, database systems, performance benchmarking, and big data
applications. They agreed that there was both a pressing need and an
opportunity for defining benchmarks to capture the end-to-end
aspects of big data applications. In presentations and working
sessions the workshop participants laid the foundation for future
workshops by agreeing on key aspects of future Big Data benchmarks,
e.g. the need to include metrics for performance as well as
price/performance; the need to consider several costs, including
total system cost, setup cost, and energy costs; and the need for an
end-to-end benchmark serving the purposes of competitive marketing
as well as product improvement.
As a result of this meeting an informal “Big Data benchmarking
community” has been formed, hosted by the San Diego Supercomputer
Center, UC San Diego. Biweekly phone conferences are being used to
keep this group engaged and to share information among members.
Several benchmarking activities were started by members of this
community. These range from simple examples of end-to-end benchmarks
to complex components setups. Within our lightning talk and poster,
we will introduce the Big Data Benchmarking Community and give an
overview of current developments in Big Data benchmarking.
The next Big Data Benchmarking workshops are currently planned for
December 17-18, 2012 in Pune, India and June 2013 in Xian, China.
With the advent of large scale sequencing projects, such as the 1000 Genomes Project, it has become increasingly difficult to store this data in traditional relational form. Genotype data may be visualized as a sparse matrix, unbounded in each direction, where the rows represent the variations and the columns are the individuals. Each element therefore contains the genotype information for an individual at a given variation location. In addition to the genotype, there are also quality scores, read depths, likelihoods, and other measures of the genotype call. There are approximately 39.6 million known variation sites on the human genome, and the 1000 Genomes project has sampled over 1200 people, with more to come. This is potentially trillions of data points, which will need to be accessed either by variation, by sample, or some combination thereof. In addition to storing and accessing the data, there is a need to compute aggregate values, such as frequency counts, over all of the data. This needs to be done for predetermined population subsets, as well as those subsets chosen on the fly. SciDB is a new open source array-based database, which they refer to as Data Management and Analytics Software System (DMAS). A prototype genotype server, developed using SciDB, was successfully used as the back-end for the genotype portion of the 1000 Genome Browser. Currently, the Genotype Archive is moving beyond the prototype stage, and is intended to be the central storage location of the underlying genotype data for all variation resources at the National Center for Biotechnology Information (NCBI).
OpenTSDB is a distributed, scalable Time Series Database (TSDB)
that runs on top of HBase. It was written to address a common need:
store, index and serve metrics collected from computer systems
(network gear, operating systems, applications) at a large scale,
and make this data easily accessible and graphable.
This lightening talk will explain how OpenTSDB is used and deployed
at StumbleUpon, and why it's so easy for us to add well over a
billion data point per day in a database that currently stores
around half a trillion data points. We'll show how we separated the
read queries from the writes, and how load balancing and caching
works in practice.
Although most discussion today revolve around Big Data, Social
Media and unstructured data most of the world is still using and
will continue to use structured analytics - Enterprise Data
Warehouse and Operational Data Stores. The Analytic Cloud is a
discussion on incorporating all these components by having an
intelligent query router that would provide several functions. First
to determine where queries should be routed (that is which platform
is best suited to recieve and process the query). Second to provide
real-time datastream analysis for urgent analytics. Finally to
provide an in-memory database for redundant queries (for speed).
Queries would enter the analytic cloud and either be responded to by
the intelligent query router or direct the query to the EDW, ODS,
Big Data analytic engine or unstructured analysis (Hadoop/Autonomy).
Submitters would not know or care which backend system responds to
the query.
Sequence alignment is becoming a cardinal operation of many
bioinformatics applications such as DNA or protein sequencing,
multiple sequence alignment and genome assembly. With increasing
computation power, those applications are increasingly I/O limited
due to the large size of biological databases (e.g. GenBank, EMBL,
DDBJ and UniProt), complexity, and poor performance of existing
storage technologies such as disk and flash. Emerging fast,
byte-addressable non-volatile memory (NVM) technologies such as
phase-change memory (PCM) and spin-torque transfer memory (STTM) are
very promising and are approaching DRAM-like performance with lower
power consumption and higher density as process technology scales.
However, the existing solid-state drive (SSD) architecture optimizes
for flash characteristics and is not adequate to exploit the full
potential of NVMs due to architectural and I/O interface (e.g. PCIe,
SATA) limitations. To eliminate I/O and better utilize NVM
technologies, we propose a novel active storage architecture called
Maverick. It extends conventional SSD architecture by offloading
bioinformatics operations to exploit the low latency and high
bandwidth of NVMs and significantly reduce the data traffic between
the host and the storage. We provide a runtime library that enables
the programmer to dispatch computations to the active storage system
without dealing with the complications of the underlying
architecture and communication management. This library ensures
thread-safety and a consistent view of data both from application
running on the host and in the storage device.
We built a prototype of Maverick on the BEE3 FPGA platform and use
UniProt Knowledgebase Release 2012/02 protein database sequences for
our evaluation. We compare the performance and energy efficiency
with the state-of-art PCIe-attached PCM-based SSD. Maverick
outperforms PCM-based SSD by up to 4 X on performance and up to 6.5
X on energy efficiency. We also compare Maverick's performance with
existing storage arrays such as RAID-disk and flash-based SSD, and
achieve more than 19 X and 6.8 X speedup as compared to the
RAID-disk and the flash-based SSD respectively. This huge
improvement can be attributed to the reduction of data transfer from
the storage to the host, the elimination of I/O and the efficient
data processing within the active storage system.
The new and emerging field of Systems Medicine, an application of
Systems Biology approaches to biomedical problems in the clinical
setting, leverages complex computational tools and high dimensional
large data sets to derive personalized assessments of disease risk.
The Georgetown Clinical & Omics Development Engine (G-CODE) is a
generic and flexible web-based platform that serves to enable basic,
translational, and clinical research activities by integrating
patient characteristics and clinical outcome data with a variety of
large-scale high throughput research data in a unified environment
to enable Systems Medicine. Through G-CODE, we hope to help enable
the creation of new disease-centric portals and the widespread use
of biomedical informatics tools by basic, clinical, and
translational researchers for large ‘omics’ data sets.
This infrastructure was first deployed in the form of the Georgetown
Database of Cancer (G-DOC®; http://gdoc.georgetown.edu), which
includes a broad collection of bioinformatics and systems biology
tools for analysis and visualization of four major “omics” types:
DNA, mRNA, microRNA, and metabolites. One obvious area for expansion
of the G-CODE/G-DOC platform infrastructure is for the support of
next generation sequencing (NGS), a highly enabling and
transformative emerging technology for the biomedical sciences.
Nonetheless, effectively utilizing this data is impeded by the
substantial handling, manipulation, and analysis requirements that
it entails. We have concluded that these are gaps that cloud
computing is well positioned to fill, as this type of infrastructure
permits rapid scaling with low input costs. As such, the Georgetown
University team is utilizing the Amazon EC2 Cloud to process whole
exome, whole genome, RNA-seq, and ChIP-seq NGS data. The processed
NGS data is being integrated within web portal to ensure that it can
be analyzed in the full context of other “omics” data. Through
technology re-use, the G-CODE infrastructure will accelerate
progress for a variety of ongoing programs in need of integrative
multi-omics analysis, and advance our opportunities to practice
effective systems medicine in the near future.
In this talk, I will compare and contrast Pig and Hive. From
overall architeture perspective both systems exhibits remarkable
similiarity. Both generate map-reduce jobs from a query written
in higher level domain specific language. However, two systems
have made number of different choices.
First of which is language. Pig exposes PigLatin, a dataflow
language while Hive exposes more traditional SQL like
declarative language HiveQL. This choice of language has
resulted in adoption of two systems by different user
communities. Hive's user base is analysts who are primarily
interested in generating reports which summarizes data in
interesting ways while Pig has stuck chord with programmers who
are writing complex scripts to garner deep insights from data
typically employing machine learning techniques.
Since these projects are open source this difference in
communities is driving the two projects in different directions.
First of it is feature set. Pig is moving towards a language of
its own. Consequently, primary consumer of software is
programmer writing pig-latin scripts. A successful language
requires an ecosystem of tools like debugger, editor, linter,
development environment etc. So, in future we may see Pig
harboring these tool chains. Hive on the other hand due to its
SQL like language can integrate with existing business
intelligence tools through its odbc/jdbc interface thereby
making it accessible to business users. Consequently, primary
consumers of software are other tools which interact with Hive
programatically. As a result focus in Hive is smooth interfaces
for easier integration with other tools.
Other difference between two system is how they tackle physical
and execution optimizations. While Hive tries to determine the
applicability of such optimizations itself, Pig assumes
programmer knows best and offers variety of optimizations from
which programmer can pick and tell Pig.
UC Santa Cruz is under contract with NIH and the National Cancer Institute to construct and operate CGHub, a nation-scale library and user portal for genomic data. This contract covers growth of the library to 5 Petabytes. The three NCI programs (TGGA, TARGET, and CGCI) that feed into the library currently submit about 20 terabytes of new data each month. We will review our handling of large (10's of GBs) files, metadata, and our performance experience with a fast novel transfer mechanism designed specifically for transferring large files across "long fat" networks based on the Genome Network Operating System, or GNOS(tm), technology provided by Annai Systems, Inc. We will also explain how we constructed the system to meet FISMA security level requirements.
This talk describes the application of mongodb at the CERN CMS
experiment. The CMS experiment uses a general-purpose detector to
investigate a wide range of physics, including the search for the
Higgs boson, extra dimensions, and particles that could make up dark
matter. Mongodb is used at CMS as a data aggregation system that
includes mapping, caching, merging, and analytics.
Petascale plasma physics simulations have recently entered the regime of simulating trillions of particles. These unprecedented simulations generate massive amounts of data, posing significant challenges in storage, analysis, and visualization. In this talk and poster, we will present parallel I/O, analysis, and visualization results from a VPIC trillion particle simulation running on 120,000 cores of the NERSC Cray XE6 system “Hopper”, which produces ~32TB of data for a single timestep. We will explain the successful application of H5Part, a particle data extension of parallel HDF5, for writing the dataset at a significant fraction of system peak I/O rates. To enable efficient analysis, we develop hybrid parallel FastQuery to index and query data using multi-core CPUs on distributed memory hardware. We show good scalability results for the FastQuery implementation using up to 10,000 cores. Finally, we will present application of this indexing/query-driven approach to facilitate the first-ever analysis and visualization of the trillion-particle dataset. The newly developed query-based visualization techniques enabled scientists to gain insights into the relationship between the structure of the magnetic field and energetic particles and to investigate the agyrotropic distribution of particles near the magnetic reconnection hot-spot in a 3D trillion particle dataset.
Zynga's analytics started with core concepts like open access and massive scale. Zynga gambled with a newer product at the time called Vertica for its warehouse. Vertica uses MPP compressed column store technology. We choose Vertica because it offered ANSI SQL support, ease of integration with common reporting and ETL systems, common SQL skillset among employees, fast loading, data compression and fast query speeds. In order for this approach to work, Zynga centralized the data infrastructure team, business intelligence database team and analysts as one organization. The control of data flow was pushed out to an API tier so that semi-structure could be applied to the data as it was logged, thereby skipping much of the pre-processing required to add structure. Data is logged in a semi-structured state so that schema changes aren't required in general as tracking needs change. The central group also maintains the data models in a way that network and platform level questions can easily and quickly be asked. Instead of each group hiring its own analysts, analysts are embedded throughout the company but report into a central group. This was key to spreading effective use and knowledge of the analytical systems. The analyst group also funnels insights back to the rest of the company as a whole. Today, Zynga loads over 70 billion rows a day in real-time into its Vertica based system, and over 80 billion rows a day to its custom streaming event tier. These systems create the backbone of other analytical systems like an experiment A/B testing platform and real-time data services for games. Today, Zynga's big data systems and an analytically driven culture are viewed as a core piece of its competitive advantage.
Jubatus is a large-scale, distributed, real-time analysis and
machine learning framework with the objective of continuous,
high-speed, deep analysis of high-volume data. Jubatus achieves
continuous, high-speed processing of high-volume data by dividing
the large-volume of data among multiple servers and processing it
sequentially and in parallel. Deep analysis requires use of
sophisticated statistical processing and machine learning, and
implementation in a distributed environment requires a framework
allowing multiple servers to share intermediate results. Such
sharing requires frequent communication between servers, and this
can become a bottleneck to overall performance if a suitable
communication method is not devised.
There are many similar points between Hadoop/Mahout and Jubatus.
These are scalable and run on commodity hardware. However, Hadoop is
not equipped with sophisticated machine learning algorithms since
most of the algorithms do not fit its MapReduce paradigm. Though
Apache Mahout is also a Hadoop-based machine learning platform,
online processing of data streams is still out of the scope.
Accordingly, Jubatus not only ensures the real-timeliness and
accuracy of data analysis, but also increases robustness by
exchanging the intermediate results among multiple servers in a
loose manner and thereby reducing the communication overhead between
servers. Jubatus uses a loose model sharing architecture for
efficient training and sharing of machine learning models by
defining three fundamental operations; Update, Mix, and Analyze, in
a similar way with the Map and Reduce operations in Hadoop.
Our development team achieves combining the latest advances in
online machine learning, distributed computing, and randomized
algorithms to provide efficient machine learning features for
Jubatus. Now Jubatus supports classification, regression, graph
mining, and recommendation.
We demonstrate Jubatus example application and show the architecture
that performs real-time SNS data analysis such as real-time
filtering, fuzzy search, relevancy ranking and company reputation
categorization.