XLDB - Extremely Large Databases

XLDB-2012 Tutorials


Monday, September 10, 2012
07:30 AM Continental Breakfast, (registration starts, Knight Management Center)
8:30 AM Morning Session (coffee break 10:20-10:40 AM)
A The Future of Analytics (part I) Stephen Brobst / Teradata & Tom Fastner / eBay Gunn 101
B Developing Applications for Apache Hadoop Sarah Sproehnle / Cloudera Gunn 102
C Bringing Relational Databases and Hadoop Together John Hax & Tom Plunkett / Oracle McClelland 105
12:30 PM Lunch Break (take away lunch served)
1:00 PM Afternoon Session (coffee break 3:20-3:40 PM)
A The Future of Analytics (part II) Stephen Brobst / Teradata Gunn 101
B Using SciDB: an array-based analytical database for integrated complex analytics and scientific data management Paul Brown & Alex Poliakov / SciDB Gunn 102
C Data Structures and Algorithms for Big Databases Michael A. Bender / SUNY Stony Brook &
Bradley C Kuszmaul / MIT
McClelland 105
5:00 PM Adjourn

The Future of Analytics
Stephen Brobst / Teradata &
Tom Fastner / eBay

This full day workshop examines the trends in analytic technologies, methodologies, and use cases. You will learn about futures in:

  • big data analytics,
  • analytics in the cloud,
  • agile analytics deployment methodologies,
  • in-database analytics,
  • new analytic paradigms such as MapReduce/Hadoop,
  • text and social media analytics,
  • analytic applications architecture, and
  • extreme analytics case studies.

The implications of these developments for deployment of analytic capabilities will be discussed with examples in future architecture and implementation. This tutorial presents best practices for deployment of a next generation analytics. We will also explore emerging trends related to extended analysis using content from Web 3.0 applications, sensor networks, and other non-traditional data sources.


Developing Applications for Apache Hadoop
Sarah Sproehnle / Cloudera

This tutorial will explain how to leverage a Hadoop cluster to do data analysis using Java MapReduce, Hive, Pig and HBase. It is recommended that participants have experience with some programming language.

Topics include:

  • Why Hadoop and MapReduce?
  • Writing a Java MapReduce program
  • Common algorithms applied to Hadoop such as indexing, classification, joining data sets and graph processing
  • Data analysis with Hive and Pig
  • Overview of writing applications that use Apache HBase

Bringing Relational Databases and Hadoop Together
John Hax & Tom Plunkett / Oracle

Effective Big Data solutions require efficient modeling, loading, and statistical analysis. In order to bring big data solutions mainstream it is imperative that the modeling and loading stages become less developer centric. The Oracle tutorial will cover the modeling, loading, and viewing of Hadoop data within an Oracle database. Typical processing in Hadoop includes data validation and transformations that are programmed as MapReduce jobs. Data loading utilizes the Oracle Loader for Hadoop, both direct loading and distributed HFS will be covered. Utilizing Oracle Data Integrator and the Oracle Data Integrator Application Adapter for Hadoop, developers can abstract the implementation MapReduce jobs, including HiveQL. Statistics is the science of learning from data, and of measuring, controlling, and communicating uncertainty; and it thereby provides the navigation essential for controlling the course of scientific and societal advances. In order to bring enterprise level tools to the “R” user, Oracle has introduced a floor of “R” that brings sophisticated, enterprise class scalability to Open Source R. The third topic covered in the tutorial is the viewing of Hadoop and relational datasets using Oracle “R”.


Using SciDB: an array-based analytical database for integrated complex analytics and scientific data management
Paul Brown & Alex Poliakov / SciDB

SciDB is an open source analytical database system for use in scientific and commercial applications that involve very large multi-dimensional data sets and scalable complex analytics. It runs on commodity hardware grids or in a cloud.

SciDB is built to address a suite of requirements shared by scientists:

  • Ingest, store, access, and manage data throughout its life cycle
  • Save raw, corrected, pre-processed, and derived data, along with meta data and provenance
  • Explore, filter, and drill down using rich selection criteria
  • Do massively scalable complex math, modeling and simulations
  • Share data across work groups and with outside organizations
  • Support reproducibility of results

This tutorial presents an overview of the SciDB architecture, the array data model, the programming and query interfaces, math library, and data management capabilities.

We will talk about how to setup a SciDB cluster, load data from various standard file formats (CSV, HDF5, FITS, FASTA, BAM, etc), and discuss schema design considerations. We will walk through two use cases, one from computational genomics and one that analyzes vegetation from satellite imagery of the earth.


Data Structures and Algorithms for Big Databases
Bradley C Kuszmaul / MIT &
Michael A. Bender / SUNY Stony Brook

This tutorial will explore data structures and algorithms for big databases. The topics include:

  • Data structures including B-trees, Log Structured Merge Trees, and Streaming B-trees.
  • Approximate Query Membership data structures including Bloom filters and cascade filters.
  • Algorithms for join including hash joins and Graefe’s generalized join.
  • Index design, including covering indexes.
  • Consistency (row locks, multiversion concurrency).
  • Getting good performance in memory.
  • Cache efficiency including both Cache-aware and Cache-oblivious data structures and algorithms.

These algorithms and data structures are used both in NoSQL implementations such as MongoDB, HBase and in SQL-oriented implementations such as MySQL and TokuDB.

This talk includes explaining and analyzing data structures. So it might not be aimed at someone who hates seeing O(N \ log N). But we'll keep the content accessible so that anyone who can tolerate some math will benefit from attending.

Privacy Statement -