XLDB-2012 Tutorials
Monday, September 10, 2012 |
07:30 AM |
Continental Breakfast, (registration starts, Knight Management Center) |
8:30 AM |
Morning Session (coffee break 10:20-10:40 AM) |
|
|
12:30 PM |
Lunch Break (take away lunch served) |
1:00 PM |
Afternoon Session (coffee break 3:20-3:40 PM) |
|
|
5:00 PM |
Adjourn |
The Future of Analytics
Stephen Brobst / Teradata &
Tom Fastner / eBay
This full day workshop examines the trends in analytic technologies,
methodologies, and use cases. You will learn about futures in:
- big data analytics,
- analytics in the cloud,
- agile analytics deployment methodologies,
- in-database analytics,
- new analytic paradigms such as MapReduce/Hadoop,
- text and social media analytics,
- analytic applications architecture, and
- extreme analytics case studies.
The implications of these developments for deployment of analytic capabilities
will be discussed with examples in future architecture and implementation.
This tutorial presents best practices for deployment of a next generation analytics.
We will also explore emerging trends related to extended analysis using content
from Web 3.0 applications, sensor networks, and other non-traditional data sources.
Developing Applications for Apache Hadoop
Sarah Sproehnle / Cloudera
This tutorial will explain how to leverage a Hadoop cluster to do data
analysis using Java MapReduce, Hive, Pig and HBase. It is recommended
that participants have experience with some programming language.
Topics include:
- Why Hadoop and MapReduce?
- Writing a Java MapReduce program
- Common algorithms applied to Hadoop such as indexing,
classification, joining data sets and graph processing
- Data analysis with Hive and Pig
- Overview of writing applications that use Apache HBase
Bringing Relational Databases and Hadoop Together
John Hax & Tom Plunkett / Oracle
Effective Big Data solutions require efficient modeling, loading, and
statistical analysis. In order to bring big data solutions mainstream it
is imperative that the modeling and loading stages become less developer
centric. The Oracle tutorial will cover the modeling, loading, and
viewing of Hadoop data within an Oracle database. Typical processing in
Hadoop includes data validation and transformations that are programmed
as MapReduce jobs. Data loading utilizes the Oracle Loader for Hadoop,
both direct loading and distributed HFS will be covered. Utilizing
Oracle Data Integrator and the Oracle Data Integrator Application
Adapter for Hadoop, developers can abstract the implementation MapReduce
jobs, including HiveQL. Statistics is the science of learning from data,
and of measuring, controlling, and communicating uncertainty; and it
thereby provides the navigation essential for controlling the course of
scientific and societal advances. In order to bring enterprise level
tools to the “R” user, Oracle has introduced a floor of “R” that brings
sophisticated, enterprise class scalability to Open Source R. The third
topic covered in the tutorial is the viewing of Hadoop and relational
datasets using Oracle “R”.
Using SciDB: an array-based analytical database for integrated complex analytics
and scientific data management
Paul Brown & Alex Poliakov / SciDB
SciDB is an open source analytical database system for use in scientific and commercial
applications that involve very large multi-dimensional data sets and scalable complex analytics.
It runs on commodity hardware grids or in a cloud.
SciDB is built to address a suite of requirements shared by scientists:
- Ingest, store, access, and manage data throughout its life cycle
- Save raw, corrected, pre-processed, and derived data, along with meta data and provenance
- Explore, filter, and drill down using rich selection criteria
- Do massively scalable complex math, modeling and simulations
- Share data across work groups and with outside organizations
- Support reproducibility of results
This tutorial presents an overview of the SciDB architecture, the array
data model, the programming and query interfaces, math library, and
data management capabilities.
We will talk about how to setup a SciDB cluster, load data from various
standard file formats (CSV, HDF5, FITS, FASTA, BAM, etc), and discuss schema
design considerations. We will walk through two use cases, one from
computational genomics and one that analyzes vegetation from satellite
imagery of the earth.
Data Structures and Algorithms for Big Databases
Bradley C Kuszmaul / MIT &
Michael A. Bender / SUNY Stony Brook
This tutorial will explore data structures and algorithms for big
databases. The topics include:
- Data structures including B-trees, Log Structured Merge Trees, and
Streaming B-trees.
- Approximate Query Membership data structures including Bloom filters
and cascade filters.
- Algorithms for join including hash joins and Graefe’s generalized join.
- Index design, including covering indexes.
- Consistency (row locks, multiversion concurrency).
- Getting good performance in memory.
- Cache efficiency including both Cache-aware and Cache-oblivious data
structures and algorithms.
These algorithms and data structures are used both in NoSQL
implementations such as MongoDB, HBase and in SQL-oriented
implementations such as MySQL and TokuDB.
This talk includes explaining and
analyzing data structures. So it might not be aimed at someone who
hates seeing O(N \ log N). But we'll keep the content accessible so
that anyone who can tolerate some math will benefit from attending.