XLDB-2012 Tutorials

Monday, September 10, 2012

07:30 AM

Continental Breakfast, (registration starts, Knight Management Center)

8:30 AM

Morning Session (coffee break 10:20-10:40 AM)

A	The Future of Analytics (part I)	Stephen Brobst / Teradata & Tom Fastner / eBay	Gunn 101
B	Developing Applications for Apache Hadoop	Sarah Sproehnle / Cloudera	Gunn 102
C	Bringing Relational Databases and Hadoop Together	John Hax & Tom Plunkett / Oracle	McClelland 105

12:30 PM

Lunch Break (take away lunch served)

1:00 PM

Afternoon Session (coffee break 3:20-3:40 PM)

A	The Future of Analytics (part II)	Stephen Brobst / Teradata	Gunn 101
B	Using SciDB: an array-based analytical database for integrated complex analytics and scientific data management	Paul Brown & Alex Poliakov / SciDB	Gunn 102
C	Data Structures and Algorithms for Big Databases	Michael A. Bender / SUNY Stony Brook & Bradley C Kuszmaul / MIT	McClelland 105

5:00 PM

Adjourn

The Future of Analytics
Stephen Brobst / Teradata &
Tom Fastner / eBay

This full day workshop examines the trends in analytic technologies, methodologies, and use cases. You will learn about futures in:

big data analytics,
analytics in the cloud,
agile analytics deployment methodologies,
in-database analytics,
new analytic paradigms such as MapReduce/Hadoop,
text and social media analytics,
analytic applications architecture, and
extreme analytics case studies.

The implications of these developments for deployment of analytic capabilities will be discussed with examples in future architecture and implementation. This tutorial presents best practices for deployment of a next generation analytics. We will also explore emerging trends related to extended analysis using content from Web 3.0 applications, sensor networks, and other non-traditional data sources.

Developing Applications for Apache Hadoop
Sarah Sproehnle / Cloudera

This tutorial will explain how to leverage a Hadoop cluster to do data analysis using Java MapReduce, Hive, Pig and HBase. It is recommended that participants have experience with some programming language.

Topics include:

Why Hadoop and MapReduce?
Writing a Java MapReduce program
Common algorithms applied to Hadoop such as indexing, classification, joining data sets and graph processing
Data analysis with Hive and Pig
Overview of writing applications that use Apache HBase

Bringing Relational Databases and Hadoop Together
John Hax & Tom Plunkett / Oracle

Effective Big Data solutions require efficient modeling, loading, and statistical analysis. In order to bring big data solutions mainstream it is imperative that the modeling and loading stages become less developer centric. The Oracle tutorial will cover the modeling, loading, and viewing of Hadoop data within an Oracle database. Typical processing in Hadoop includes data validation and transformations that are programmed as MapReduce jobs. Data loading utilizes the Oracle Loader for Hadoop, both direct loading and distributed HFS will be covered. Utilizing Oracle Data Integrator and the Oracle Data Integrator Application Adapter for Hadoop, developers can abstract the implementation MapReduce jobs, including HiveQL. Statistics is the science of learning from data, and of measuring, controlling, and communicating uncertainty; and it thereby provides the navigation essential for controlling the course of scientific and societal advances. In order to bring enterprise level tools to the “R” user, Oracle has introduced a floor of “R” that brings sophisticated, enterprise class scalability to Open Source R. The third topic covered in the tutorial is the viewing of Hadoop and relational datasets using Oracle “R”.

Using SciDB: an array-based analytical database for integrated complex analytics and scientific data management
Paul Brown & Alex Poliakov / SciDB

SciDB is an open source analytical database system for use in scientific and commercial applications that involve very large multi-dimensional data sets and scalable complex analytics. It runs on commodity hardware grids or in a cloud.

SciDB is built to address a suite of requirements shared by scientists:

Ingest, store, access, and manage data throughout its life cycle
Save raw, corrected, pre-processed, and derived data, along with meta data and provenance
Explore, filter, and drill down using rich selection criteria
Do massively scalable complex math, modeling and simulations
Share data across work groups and with outside organizations
Support reproducibility of results

This tutorial presents an overview of the SciDB architecture, the array data model, the programming and query interfaces, math library, and data management capabilities.

We will talk about how to setup a SciDB cluster, load data from various standard file formats (CSV, HDF5, FITS, FASTA, BAM, etc), and discuss schema design considerations. We will walk through two use cases, one from computational genomics and one that analyzes vegetation from satellite imagery of the earth.

Data Structures and Algorithms for Big Databases
Bradley C Kuszmaul / MIT &
Michael A. Bender / SUNY Stony Brook

This tutorial will explore data structures and algorithms for big databases. The topics include:

Data structures including B-trees, Log Structured Merge Trees, and Streaming B-trees.
Approximate Query Membership data structures including Bloom filters and cascade filters.
Algorithms for join including hash joins and Graefe’s generalized join.
Index design, including covering indexes.
Consistency (row locks, multiversion concurrency).
Getting good performance in memory.
Cache efficiency including both Cache-aware and Cache-oblivious data structures and algorithms.

These algorithms and data structures are used both in NoSQL implementations such as MongoDB, HBase and in SQL-oriented implementations such as MySQL and TokuDB.

This talk includes explaining and analyzing data structures. So it might not be aimed at someone who hates seeing O(N \ log N). But we'll keep the content accessible so that anyone who can tolerate some math will benefit from attending.