XLDB - Extremely Large Databases

Accepted Poster-only Abstracts

Poster authors: please see Poster Preparation Guidelines.

Tue-10: Fail-Proofing Hadoop Clusters with Automatic Service Failover
Michael Dalton / Zettaset

With the increase use of Hadoop comes an increase in the need for critical safeguards, especially in the financial industry where any data loss could mean millions in penalties. What happens when parts of an Hadoop cluster goes down? How do Hadoop based solutions for the financial industry cope with NameNode failures?

We will share the failover issues we've encountered and best practices for performing continuous health monitoring, best suited for the financial industry. In addition, we will cover ZooKeeper-based failover for NameNode and other related SPOF services (e.g., JobTracker, Oozie, Kerberos).


Tue-11: Data is Dead -- Without What-If Models
Pat Selinger / IBM

In the beginning, there were data transactions and simple reports. Then came the relational model, SQL, and high-performance relational DBMSs to run transactions and generate simple reports in an elegant manner. This primeval form of descriptive analytics was enhanced with OLAP, data mining, and other business-intelligence technologies as enterprises realized that there was valuable information to be extracted from transactional data. Since then, data technologies have been created to handle semi-structured data, unstructured text, web-based data, semantic data, uncertain data, and streaming data at scales approaching the Exabyte range. Data management functionality has expanded to include simple programming within the data storage system, various sorts of statistical analyses, and even more recently, machine learning techniques. Now we are in an era of XLDB and must extend and re-invent technologies to dramatically expand the range of data we can store, query, manage, protect, and analyze.

While the data community can be justifiably proud of what it has accomplished so far, the authors have come to a set of realizations, leading us to the conclusion that our focus on data is much too narrow and must be expanded dramatically. These realizations further complicate the challenges XLDB creates:
Realization #1. Data centrism is wrong because data is dead, a record of the past.
Realization #2. Data alone-even with very powerful descriptive analytics-tells us about the world as it is, and was, but cannot tell us much about the world as it might be.
Realization #3. The database community needs to put what-if models and data on equal footing, developing systems that use both data and models to make sense of rich, real-world complexity and to support real-world decision-making.

This model-and-data orientation requires significant extensions of many traditional database technologies, such as data integration, query optimization and processing, and collaborative analytics. In this short presentation, we argue that data without what-if modeling may be the database community's past, but data with what-if modeling must be its future. We will briefly describe a research prototype that is aiming to do just that.


Tue-12: Backup and Restore strategies for a Giant
Ruben Gaspar / CERN

With 80Tb of data nowadays, the accelerators logging database (ACCLOG) running as Oracle cluster database will grow at a pace of 100TB per year, making about 300TB in 2 years time. We would show in the poster the different strategies we use in order to backup/restore this database giving a short overview of its importance for the LHC experiment together with the technological evolution behind it (snapshots, 10gbps IBM Tivoli server, storage and server evolution, High Availability, NFS, etc.).


Tue-13: Implementing the SSDB Benchmark on SciDB
DIFALLAH Djellel Eddine / UNIFR

In this poster, we describe how we implemented the SSDB benchmark on the latest version of SciDB. SSDB is a new science-oriented benchmark modeled around an astronomy use-case with different levels of difficulty, mostly bound to the data size in order to test the scalability of the target systems. The input data is a collection of raw images randomly generated in a large space. The benchmark has a pre-processing phase--called ''Cooking''--that extracts notable features in the images, and a workload composed of nine different queries that manipulate both the extracted features and the raw imagery. The queries in SSDB reproduce common operations in science applications: spatial queries, aggregates, data processing, and geometry operations. We have implemented the processing phases and queries using the latest version of the SciDB Scientific Data Management System (http://www.scidb.org/). In this poster, we describe the various operators and User Defined Functions (UDFs) we used to implement the complete benchmark. We also report on the performance of our implementation running the different variants of the benchmark on a cluster of commodity machines.


Tue-14: Big Data Challenges in Application Performance Management
Tilmann Rabl / Univ of Toronto

See Big Data Challenges in Application Performance Management


Tue-15: Large-Scale Databases in Gaia
Pilar de Teodoro / ESA

The ESA Gaia satellite will fly during five years harvesting more than 100TB of astronomical data. The data will be transmitted to Earth where it will be exploited, the total quantity of data reaching around 1 PB. The resulting accurate astrometric, spectroscopic and photometric data will be the state of the art for the following years. A final public catalogue will be published.

Sophisticated data processing is necessary to distill this information stored in different systems and databases. This poster presents the use of these extremely large databases to be used by Gaia.


Wed-10: From High-throughput Single Molecule Measurement Data to Actionable Bioinformatics Knowledge
Jason Chin / Pacific Biosciences

It is paramount to utilize modern paradigm for processing large amounts of data in the development process of new technologies. The challenge of processing large amounts data efficiently is no longer unique in large data warehouses or large scientific labs involved in big science projects. It becomes common for small labs or small companies to efficiently process data generated by high throughput scientific instrumentation to be successful in their research goals. I will go over how a small start-up developing a high throughput DNA sequencer solves its own large-data problem: compressing highly parallel real-time observations of chemical reactions at single molecule level to DNA basecalls. Moreover, I will also discuss the challenges of processing the real-time kinetic signals from the underlying chemical reactions to reveal novel information about the bio-molecules. Finally, I will comment on the importance to adopt advances in informatics to properly interpret large-scale, high-dimensional data sets toward success in the life sciences research and lead to new discovery.


Wed-11: Federal Market Information Technology in the Post Flash Crash Era: Roles for Supercomputing
John Wu / LBL

This brief presentation describes the collaborative work between active traders, regulators, economists, and supercomputing researchers to replicate and extend investigations of the Flash Crash and other market anomalies in a National Laboratory HPC environment. This work suggests that supercomputing tools and methods will be valuable to market regulators in achieving the goal of market safety, stability, and security. Research results using high frequency data and analytics are described, and directions for future development are discussed.

Currently the key mechanism for preventing catastrophic market action is the circuit breakers. We believe a more graduated approach, similar to the “yellow light” approach in motorsports to slow down traffic, might be a better way to achieve the same goal. To enable this, we study a number of indicators that could foresee hazards in market conditions, and explore options to confirm such predictions. This is a preliminary step toward a full-fledged early-warning system for unusual market conditions.


Wed-12: Genome Annotation with Full-text Biomedical Research Articles
Maximilian Haeussler / CBSE, UC Santa Cruz

Medical and biological research articles are a complex and very heterogenous source of information - even for human readers. In addition, they are locked behind paywalls guarded by dozens of publishers. Making the knowledge in these publications more accessible to biomedical researchers is a challenge for big-data projects like the the human genome and the various cancer sequencing efforts, both hosted on our servers at UCSC. Existing NLP-based text mining software is usually too slow to process big datasets as it was developed for short abstracts or only a selected set of articles.

To remedy this situation, we are building a database of published biomedical fulltext articles and supplementary information and are developing algorithms that are fast enough to run on all content. In collaboration with Elsevier, we already have assembled around two of the ten million indexed biomedical research articles. The data are analysed with our algorithms, targeted towards automatic genome annotation. They extract DNA and protein sequences, gene identifiers and species names from the text. DNA and protein sequences found in the text are mapped to sequenced genomes with approximate fuzzy string matching using a Hadoop-like infrastructure on a 1000-CPU cluster, then presented on our successful genome browser interface (http://genome.ucsc.edu). We are looking for collaborations and contacts to extend the number of documents and continue to work on more information extraction algorithms.

We believe that this unique data resource will allow big-data researchers in biology and medical genetics to spend less time manually querying internet search engines with keywords like gene names and instead use our software to connect their with genomic results directly to the URLs of the relevant literature.


Wed 13: Modeling Dengue Fever: An extra large data challenge
Kun Hu / IBM

In recent decades, dengue has become a major international public health concern. Dengue Fever (DF) is now endemic in more than 100 countries and impact more than one-third (around 2.5 billion) of the world's population.

Dengue is a vector-borne disease which is transmitted by the bite of an infectious Aedes mosquito. There are four serotypes of dengue viruses (DENV): DENV 1 - 4. Infection with one serotype affords life-long immunity to that serotype but only partial immunity to other serotypes for a short period. The partial immunity increases the risk of developing dengue hemorrhage fever (DHF) and dengue shock syndrome (DSS) upon re-infection mainly because of the effect of antibody dependent enhancement (ADE). This makes the viral infection much more acute, and become a leading cause of hospitalization and death. No specific vaccine and treatment is available to protect humans from infection.

We report a computational model of DF based on the open source Spatio-Temporal Epidemiological Modeler, STEM, (http://www.eclipse.org/stem/). The STEM framework is designed to support modeling of complex zoonotic disease. STEM included reference data for the entire planet including human population, transportation, and ten years of historic climate data. A compartment model (based on differential equations) for this complex multi-serotype virus requires modeling 51 independent compartments or states. It must also be built upon human (host) and Aedes mosquito (vector) population models. The Aedes vector capacity model is based on earth science data including temperature, rainfall, vegetation and elevation. A better understanding of how the dynamics of DF disease dissemination depends on vector capacity may potentially suggest strategies for the control of the disease.

The inherent complexity of a DF model also makes it an excellent test case to explore the extra large data requirement involved in global pandemic simulation. The input or denominator data requirements are moderate. Today the model is built using historic earth science data that is only available monthly so the raw input data is only ~4GB/year. Once calibrated, integrating the DF model on a global climate model would consume <~600GB reference data for a 10-year simulation. Managing the data produced by the simulation is more of a challenge. A single-year of a global-scale dengue simulation produces ~50 GB of raw data per run (sampling daily). Explore the space of ~80 model parameters using STEM's Nelder-Mead Simplex algorithm could require thousands of simulations and several years of simulated time.

Today STEM runs multi-threaded which is an Eclipse Rich Client application. For global-scale simulation it makes sense to distribute the STEM computation. In this case, a distributed storage and database scheme would be a logical candidate for consuming simulation data and serving the Eclipse data plugins. Distributed STEM would benefit from a distributed data storage paradigm, such as BigTable/ HBase and combined with an analytics infrastructure built on Hadoop. As an Eclipse application, STEM is designed to support collaboration and users would also greatly benefit from a scalable scheme to store the outputs and from tooling to make this data repeatedly available not just for analysis within STEM but to third parties as well.


Wed 14: GLADE: A Scalable Framework for Efficient Analytics
Florin Rusu / UC Merced

In this paper we introduce GLADE, a scalable distributed framework for large scale data analytics. GLADE consists of a simple user-interface to define Generalized Linear Aggregates (GLA), the fundamental abstraction at the core of GLADE, and a distributed runtime environment that executes GLAs by using parallelism extensively.

GLAs are derived from User-Defined Aggregates (UDA), a relational database extension that allows the user to add specialized aggregates to be executed inside the query processor. GLAs extend the UDA interface with methods to Serialize/Deserialize the state of the aggregate required for distributed computation. As a significant departure from UDAs which can be invoked only through SQL, GLAs give the user direct access to the state of the aggregate, thus allowing for the computation of significantly more complex aggregate functions.

GLADE runtime is an execution engine optimized for the GLA computation. The runtime takes the user-defined GLA code, compiles it inside the engine, and executes it right near the data by taking advantage of parallelism both inside a single machine as well as across a cluster of computers. This results in remarkable performance for a variety of analytical tasks -- billions of tuples are processed in seconds using only a dozen of commodity nodes -- and linear scaleup.


Wed 15: The CGHub Cancer Genomics Data Repository
Brian Craft / UCSC

The UCSC Cancer Genomics Project is deploying CGHub: a multi-petabyte-scale, highly secure repository of cancer genomic data which provides reliable, and very high-speed up- and download services for genomic sequencing and analysis centers. The repository hosts data for The Cancer Genome Atlas (TCGA), which is producing a comprehensive characterization of the genomic changes in all major human cancers. This data set will be of incalculable value in cancer research, and ultimately will lead to a revolution in cancer care. TCGA's data characterization centers currently produce approximately 10 terabytes of data each month. Over the next two years this output is likely to increase tenfold or more, resulting in a data set exceeding a petabyte in size, increasing 5 fold to 5 petabyte for the finished 10,000 patient TCGA project over the next 2.5 years.
Users access the repository over a high-performance 10 gigabit/second transport network. A novel client is provided, GeneTorrent, which scales to this bandwidth while providing end-to-end security in transit. Users also have lower speed access via the public Internet and Internet2. To maximize transfer efficiency, pre-upload data validation tools are provided to sequencing centers, and a workflow monitoring system ensures that data is sanity checked prior to release.

While high-level TCGA data is open access, potentially personally identifiable data is only available via controlled access, with approval by the National Institutes of Health.

Privacy Statement -