Biology is experiencing a deluge of sequencing data, coming from a variety of different sequencing technologies and biological sources. This sequencing data is quickly transforming biology into a data intensive science -- a transformation that most biologists are not trained for. Integrating sequencing data into experimental biology projects requires new tools, approaches, and training. Our lab has been working on developing and applying Big Data approaches for robust, sensitive, and efficient analysis of sequencing data. We've developed several novel streaming and compression approaches that provide substantial leverage to the problem of scaling sequence data analysis, and we are also working on trans-disciplinary training as well as expectation management for biologists new to computation.
Bionimbus is an open source multi-petabyte cloud computing platform for managing, analyzing, archiving and sharing large genomic datasets. Bionimbus contains a variety of genomics datasets, including the 275+ TB 1000 Genome dataset. In this talk, we discuss some of the lessons we learned operating Bionimbus over the past three years and some of the differences between science clouds, such as Bionimbus, and commercial clouds.
NASA deals with some of the largest and most complex data sets on the planet. When dealing with large amounts of data in a traditional IT environment, it can take several months and hundreds of hours of labor by several different people to procure, set up, configure, and maintain new IT infrastructure. Moreover, NASA must comply with a host of data security and privacy policies, which created challenges to finding a collaborative environment in which their scientists and researches could share data with outside partners and the public.
Hear how NASA, via an open-source cloud computing project, developed an alternative to the costly construction of additional data centers, provided a simplified avenue for the sharing of large, complex data sets with external partners and the public, and ultimately save hundreds of staff hours, allowing its scientists to focus on mission-critical activities instead of IT infrastructure requirements.
Key Takeaways:
Google BigQuery is a data analysis tool born from Google's internal Dremel technology. It enables developers building applications which run ad-hoc, iterative analyses on billions of rows of data in seconds using a RESTful API. This talk will give an overview of BigQuery, talk briefly about the underlying architecture and demonstrate its power by running some live queries against public datasets.
A big data project on the horizon? Here are some things to
consider as you plan. Already living with big data? Some tips
and techniques to deal with the
unique challenges which come with big data. Based on several years
of living above the trillion record line.
Virginia Tech's Naturalistic Driving Studies use instrumented vehicles to study behavior related to transportation safety. The most notable of these studies, the SHRP2 NDS, with over 3,000 vehicles, is the largest such study. This national study will gather over a petabyte of compressed video and over 150 terabytes of sensor data. This dataset is designed to be a cornerstone for safety researchers to analyze and mine over the next twenty years or more.
Large-scale naturalistic driving studies blend many data types and are a good example of scientific research data. Virginia Tech's Scientific Data Warehouse is an infrastructure blending cluster file systems and compute clusters with parallel database systems. We offer this design as a reference architecture for data-intensive science. We will discuss the infrastructure design of this data center in detail as well as the workflows and data structures used to maintain and process the data.
In today's highly data driven economy the amount of data logged by the CMS Experiment on the Large Hadron Collider at CERN (European Organization for Nuclear Research) may not be exceptional. However, the organisation, dissemination and subsequent analysis of this data entails many unique approaches. The main elements of this evolved strategy will be discussed together with some indications of future directions.
Talk will cover data management aspects of the Shot Data Systems at the National Ignition Facility.
Efficiency and Agility in Extreme Data Analytics do not fit the standard processes Data Warehousing has followed for many years. The larger and more complex the environment, the more important guiding principles and architecture are to building infrastructure which can accelerate future unknown business with faster time to market. A simple set of principles such as Simplicity, Time to Market, Agility, Designing for the Unknown, Analytics as a Service, and Pattern based Development can guide successful implementation of an infrastructure and platform which will sustain the business far into the future. The presentation will show how the principles apply with examples such as Data Engine Templates, Analytics as a Service, and Reusable infrastructure components. Building productive, efficient processes and infrastructure gives your business agility and competitiveness where the pace of knowledge acquisition defines competitive ability.
At Facebook, we use various types of databases and storage system
to satisfy the needs of different applications. The solutions built
around these data store systems have a common set of requirements:
they have to be highly scalable, maintainence costs should be low
and they have to perform efficiently. We use a sharded
mySQL+memcache solution to support realtime access of tens of
petabytes of data and we use TAO to provide consistency of this
web-scale database across geographical distances. We use Haystack
datastore for storing the 3 billion new photos we host every week.
We use Apache Hadoop to mine intelligence from 100 petabytes of
clicklogs and combine it with the power of Apache HBase to store all
Facebook Messages.
This talk describes the reasons why each of these databases are
appropriate for their workloads and the design decisions and
tradeoffs that were made while implementing these solutions. We
touch upon the consistency, availability and partitioning tolerance
of each of these solutions. We touch upon the reasons why some of
these systems need ACID semantics and other systems do not. We
describe the evolution of our mySQL databases to a pure SSD based
deployment.
Zynga's analytics started with core concepts of open access and massive scale. Zynga gambled with a newer technology called MPP compressed column store because it potentially offered full ansi sql, ease of integration with common reporting and ETL systems, reduced hardware needs, common SQL skillset among employees, fast loading, fast data availability and fast query speeds. Vertica was the database engine chosen. In order for this approach to work, Zynga centralized the data infrastructure team, business intelligence database team and analysts into one group. The control of data flow was pushed out to an API tier so that semi-structure could be applied to the data as it was logged, thereby skipping much of the pre-processing required. Data is logged in a semi-structured state so that schema changes aren't required in general as tracking needs change. The central group also maintains the data models in a way that network and platform level questions can easily be asked. Instead of each group hiring its own analysts, analysts are embedded throughout the company but report into a central group. This was key to spreading effective use and knowledge of the analytical systems. The analyst group also funnels insights back to the rest of the company as a whole. Today, Zynga loads over 70 billion rows a day in real-time into its Vertica based system, and over 80 billion rows a day to its custom streaming event tier. These systems create the backbone of other analytical systems like an experiment A/B testing platform and real-time data services for games. Today, Zynga's big data systems and an analytically driven culture are viewed as a core piece of its competitive advantage.
When Dr. Stonebraker attacked the elephants, citing that the
era of One Size Fits All was over, he landed a body blow. In smashing the
behemoths, Mike inspired a new generation of specialized systems for stream
processing, analytics, transaction processing and document stores.
Today, most production data management systems follow this splintered,
specialized approach. The mass adoption of ETL, data migration and
MDM has been driven by the need to access data in different formats stored
in disparate systems. While these specialized systems are not going
away, they are being permanently overwhelmed by the phenomenon that's
become known as Big Data.
In the shadow of the old guard elephants, a new beast also emerged.
A fuzzier, kinder elephant called Hadoop. Hadoop was built
specifically to address Big Data. Inspired by the problems encountered in Google's
mission to organize all the world's information and make it accessible,
Hadoop is now being used to tackle Big Data challenges in nearly every
industry.
As Stonebraker has been keen to point out, the principle programming
paradigm of Hadoop is not a panacea. Indeed, despite the flexibility
of MapReduce, it is a very low level primitive for the average data
pioneer and in many cases is a less than ideal paradigm for analyzing data.
The combination of flexibility and inaccessibility has inspired higher-level
languages such as the Hive Query Language and Pig Latin. Hadoop is
also used in conjunction with classical programming languages such as
Python and Perl or with statistical language such as R.
In this talk we will describe the current capabilities and future
direction of the Hadoop ecosystem and how this approach compares to
the OSFA philosophy that has become the dominant approach to tackling
relational data management challenges.
Extremely large databases are growing in importance worldwide as the era of Internet connectivity solidifies its reach across all industries. Faster and more energy efficient memory is a key to a growth of XLDB’s and will continue to play a critical role over the next decade. Samsung, the world leader in memory technology and production, will provide perspective on where DRAM and solid state drives (SSDs) are headed over the next few years, while noting technological challenges that can impede their advancement. In this presentation we will also cover longer-term technology trends for DRAM and flash memory as well as other future memory opportunities.
In this talk, I will enumerate some of the many tradeoffs between the analytical systems available for processing the extremly large data sets of interest to the XLDB community. We will compare Vertica and other parallel database systems, Hadoop/MapReduce, HBase/Cassandra, and Pig/Hive.
“Big Data” means different things to different people. To me, it
means one of three totally different problems:
Big volume. The traditional data warehouse vendors support SQL
analytics on very large volumes of data. In my opinion, they are
doing a good job on “small analytics”. In contrast, “big analytics”
means data clustering, regressions, machine learning, and other much
more complex analytics. These are generally not expressible in SQL
and are best served by array-oriented DBMSs. I briefly discuss
SciDB, as an example of this new category of DBMSs, as well as other
alternatives. I also explain why I believe the market will quickly
move from small analytics to big analytics.
Big velocity. By this I mean being able to absorb and process a
firehose of incoming data for applications like electronic trading.
In this market, the traditional SQL vendors are a non-starter. I
will discuss alternatives including complex event processing (CEP),
NoSQL and NewSQL systems.
Big variety. Many enterprises are faced with integrating a larger
and larger number of data sources with diverse data (spreadsheets,
web sources, XML, traditional DBMSs). The traditional ETL products
do not appear up to the challenges of this new world, and I talk
about an alternate way to go.
I then briefly summarize the goals of a new Intel Science and
Technology Center focused on big data, that is being started at
M.I.T.