XLDB-2012 Abstracts

Streaming and Compression Approaches for Terascale Biological Sequence Data Analysis

Biology is experiencing a deluge of sequencing data, coming from a variety of different sequencing technologies and biological sources. This sequencing data is quickly transforming biology into a data intensive science -- a transformation that most biologists are not trained for. Integrating sequencing data into experimental biology projects requires new tools, approaches, and training. Our lab has been working on developing and applying Big Data approaches for robust, sensitive, and efficient analysis of sequencing data. We've developed several novel streaming and compression approaches that provide substantial leverage to the problem of scaling sequence data analysis, and we are also working on trans-disciplinary training as well as expectation management for biologists new to computation.

Bionimbus: Lessons from a Petabyte-Scale Science Cloud Service Provider

Robert Grossman

Bionimbus is an open source multi-petabyte cloud computing platform for managing, analyzing, archiving and sharing large genomic datasets. Bionimbus contains a variety of genomics datasets, including the 275+ TB 1000 Genome dataset. In this talk, we discuss some of the lessons we learned operating Bionimbus over the past three years and some of the differences between science clouds, such as Bionimbus, and commercial clouds.

Big Data; It's Rocket Science

Joshua McKenty

NASA deals with some of the largest and most complex data sets on the planet. When dealing with large amounts of data in a traditional IT environment, it can take several months and hundreds of hours of labor by several different people to procure, set up, configure, and maintain new IT infrastructure. Moreover, NASA must comply with a host of data security and privacy policies, which created challenges to finding a collaborative environment in which their scientists and researches could share data with outside partners and the public.

Hear how NASA, via an open-source cloud computing project, developed an alternative to the costly construction of additional data centers, provided a simplified avenue for the sharing of large, complex data sets with external partners and the public, and ultimately save hundreds of staff hours, allowing its scientists to focus on mission-critical activities instead of IT infrastructure requirements.

Key Takeaways:

Learn how to leverage a small private cloud as a gateway to larger public cloud environments
Learn how to automate data replication at creation and/or modification time
Learn how to ensure that your cloud environments have well-defined, open and standard APIs
Learn which private and public cloud providers specifically address the challenges of large data sets, including native support for data locality, chunking, network peering, etc.

Crunching Big Data with Google BigQuery

Ryan Boyd

Google BigQuery is a data analysis tool born from Google's internal Dremel technology. It enables developers building applications which run ad-hoc, iterative analyses on billions of rows of data in seconds using a RESTful API. This talk will give an overview of BigQuery, talk briefly about the underlying architecture and demonstrate its power by running some live queries against public datasets.

10 Billion Here, 10 Billion There

Tom Brown

A big data project on the horizon? Here are some things to consider as you plan. Already living with big data? Some tips and techniques to deal with the
unique challenges which come with big data. Based on several years of living above the trillion record line.

Petascale Naturalistic Driving Study

Clark Gaylord

Virginia Tech's Naturalistic Driving Studies use instrumented vehicles to study behavior related to transportation safety. The most notable of these studies, the SHRP2 NDS, with over 3,000 vehicles, is the largest such study. This national study will gather over a petabyte of compressed video and over 150 terabytes of sensor data. This dataset is designed to be a cornerstone for safety researchers to analyze and mine over the next twenty years or more.

Large-scale naturalistic driving studies blend many data types and are a good example of scientific research data. Virginia Tech's Scientific Data Warehouse is an infrastructure blending cluster file systems and compute clusters with parallel database systems. We offer this design as a reference architecture for data-intensive science. We will discuss the infrastructure design of this data center in detail as well as the workflows and data structures used to maintain and process the data.

Finding the "Higgs" in the Haystack

Stephen Gowdy

In today's highly data driven economy the amount of data logged by the CMS Experiment on the Large Hadron Collider at CERN (European Organization for Nuclear Research) may not be exceptional. However, the organisation, dissemination and subsequent analysis of this data entails many unique approaches. The main elements of this evolved strategy will be discussed together with some indications of future directions.

A self-organizing Repository for Fusion Science

Tim Frazier

Talk will cover data management aspects of the Shot Data Systems at the National Ignition Facility.

Principles and Patterns in the Extreme Data & Analytics Ecosystem

Michael McIntire

Efficiency and Agility in Extreme Data Analytics do not fit the standard processes Data Warehousing has followed for many years. The larger and more complex the environment, the more important guiding principles and architecture are to building infrastructure which can accelerate future unknown business with faster time to market. A simple set of principles such as Simplicity, Time to Market, Agility, Designing for the Unknown, Analytics as a Service, and Pattern based Development can guide successful implementation of an infrastructure and platform which will sustain the business far into the future. The presentation will show how the principles apply with examples such as Data Engine Templates, Analytics as a Service, and Reusable infrastructure components. Building productive, efficient processes and infrastructure gives your business agility and competitiveness where the pace of knowledge acquisition defines competitive ability.

A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook

Dhruba Borthakur

At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintainence costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support realtime access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack datastore for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of clicklogs and combine it with the power of Apache HBase to store all Facebook Messages.

This talk describes the reasons why each of these databases are appropriate for their workloads and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We describe the evolution of our mySQL databases to a pure SSD based deployment.

Analytics at Zynga

Daniel McCaffrey

Zynga's analytics started with core concepts of open access and massive scale. Zynga gambled with a newer technology called MPP compressed column store because it potentially offered full ansi sql, ease of integration with common reporting and ETL systems, reduced hardware needs, common SQL skillset among employees, fast loading, fast data availability and fast query speeds. Vertica was the database engine chosen. In order for this approach to work, Zynga centralized the data infrastructure team, business intelligence database team and analysts into one group. The control of data flow was pushed out to an API tier so that semi-structure could be applied to the data as it was logged, thereby skipping much of the pre-processing required. Data is logged in a semi-structured state so that schema changes aren't required in general as tracking needs change. The central group also maintains the data models in a way that network and platform level questions can easily be asked. Instead of each group hiring its own analysts, analysts are embedded throughout the company but report into a central group. This was key to spreading effective use and knowledge of the analytical systems. The analyst group also funnels insights back to the rest of the company as a whole. Today, Zynga loads over 70 billion rows a day in real-time into its Vertica based system, and over 80 billion rows a day to its custom streaming event tier. These systems create the backbone of other analytical systems like an experiment A/B testing platform and real-time data services for games. Today, Zynga's big data systems and an analytically driven culture are viewed as a core piece of its competitive advantage.

Big Data and the Era of One Size (doesn't) Fit All/a>

Omer Trajman

When Dr. Stonebraker attacked the elephants, citing that the era of One Size Fits All was over, he landed a body blow. In smashing the behemoths, Mike inspired a new generation of specialized systems for stream processing, analytics, transaction processing and document stores. Today, most production data management systems follow this splintered, specialized approach. The mass adoption of ETL, data migration and MDM has been driven by the need to access data in different formats stored in disparate systems. While these specialized systems are not going away, they are being permanently overwhelmed by the phenomenon that's become known as Big Data.

In the shadow of the old guard elephants, a new beast also emerged. A fuzzier, kinder elephant called Hadoop. Hadoop was built specifically to address Big Data. Inspired by the problems encountered in Google's mission to organize all the world's information and make it accessible, Hadoop is now being used to tackle Big Data challenges in nearly every industry.

As Stonebraker has been keen to point out, the principle programming paradigm of Hadoop is not a panacea. Indeed, despite the flexibility of MapReduce, it is a very low level primitive for the average data pioneer and in many cases is a less than ideal paradigm for analyzing data. The combination of flexibility and inaccessibility has inspired higher-level languages such as the Hive Query Language and Pig Latin. Hadoop is also used in conjunction with classical programming languages such as Python and Perl or with statistical language such as R.

In this talk we will describe the current capabilities and future direction of the Hadoop ecosystem and how this approach compares to the OSFA philosophy that has become the dominant approach to tackling relational data management challenges.

Memory's Role in Improving the Efficiency of Extremely Large Databases

Kenny Han

Extremely large databases are growing in importance worldwide as the era of Internet connectivity solidifies its reach across all industries. Faster and more energy efficient memory is a key to a growth of XLDB’s and will continue to play a critical role over the next decade. Samsung, the world leader in memory technology and production, will provide perspective on where DRAM and solid state drives (SSDs) are headed over the next few years, while noting technological challenges that can impede their advancement. In this presentation we will also cover longer-term technology trends for DRAM and flash memory as well as other future memory opportunities.

Tradeoffs between Massively Parallel Analytical Systems

Andrew Lamb

In this talk, I will enumerate some of the many tradeoffs between the analytical systems available for processing the extremly large data sets of interest to the XLDB community. We will compare Vertica and other parallel database systems, Hadoop/MapReduce, HBase/Cassandra, and Pig/Hive.

Big Data is (at least) Three Different Problems

Michael Stonebraker

“Big Data” means different things to different people. To me, it means one of three totally different problems:

Big volume. The traditional data warehouse vendors support SQL analytics on very large volumes of data. In my opinion, they are doing a good job on “small analytics”. In contrast, “big analytics” means data clustering, regressions, machine learning, and other much more complex analytics. These are generally not expressible in SQL and are best served by array-oriented DBMSs. I briefly discuss SciDB, as an example of this new category of DBMSs, as well as other alternatives. I also explain why I believe the market will quickly move from small analytics to big analytics.

Big velocity. By this I mean being able to absorb and process a firehose of incoming data for applications like electronic trading. In this market, the traditional SQL vendors are a non-starter. I will discuss alternatives including complex event processing (CEP), NoSQL and NewSQL systems.

Big variety. Many enterprises are faced with integrating a larger and larger number of data sources with diverse data (spreadsheets, web sources, XML, traditional DBMSs). The traditional ETL products do not appear up to the challenges of this new world, and I talk about an alternate way to go.

I then briefly summarize the goals of a new Intel Science and Technology Center focused on big data, that is being started at M.I.T.

Platinum Sponsors

Gold Sponsors

Silver Sponsors

Donors

Stanford University does not purchase or endorse any sponsors’ products or services.