XLDB - Extremely Large Databases

Accepted Lightning Talk Abstracts

Stephen Brobst / CTO Teradata

The concept of "Data Lake" is becoming a widely adopted best practice in constructing an analytic ecosystem. When well executed, a data lake strategy will increase analytic agility for an organization and provide a foundation for provisioning data into both discovery and integrated data warehouse platforms. However, there are many pitfalls with this approach that can be avoided with best practices deployment. This talk will provide guidance on how to deploy toward the "data reservoir" concept rather than ending up with an all too common "data swamp".

Ozgun Erdogan / CTO and Co-Founder Citus Data

Relational databases are known to have tightly integrated components that make them hard to extend. This usually leads to new databases that are specialized for scale and certain workloads.
For its last three releases, PostgreSQL has been providing official APIs to extend the database *without forking it*. Users can read the documentation, implement their ideas according to the APIs, and then dynamically load in their shared object to override or extend any database submodule.
In this talk, we will demonstrate four example APIs that make PostgreSQL a distributed database that focuses on real-time workloads:
1. JSON / JSONB: This new data type enables storing and querying semi-structured data.
2. HyperLogLog: This extension creates a new data type, user-defined functions, and aggregate functions to quickly calculate count(distinct) approximations over large data sets.
3. PGStrom: This extension offloads particular CPU intensive workloads to GPU devices and enables running parallelized SQL on GPUs.
4. Citus: Citus shards, replicates, and parallelizes SQL queries. It does this by hooking into Postgres' planner and executor APIs and through transforming queries into their parallelizable form.

In summary, PostgreSQL's extensible architecture puts it in a unique place for scaling out SQL and also for adapting to evolving hardware trends. It could just be that the monolithic SQL database is dying. If so, long live Postgres!

Steve Gonzales / Principal Manager at Think Big Analytics

The Hadoop ecosystem is highly complex with many moving parts. Even with well thought out road maps and architectures there are numerous pit falls that a project team falls into when implementing Hadoop Distribution Upgrades, Security Implementations and developing Data Ingestion capabilities. This presentation will enable you to avoid these anti patterns and focus on designs that are proven to work.

Sang-Won Lee / Professor at SKKU SICE

In Sungkyunkwan University, Korea, we are now developing MyFlashSQL, an open-source dialect of MySQL, which is to optimize MySQL/InnoDB engine on top of flash SSDs and non-volatile RAMs. In this talk, we will introduce its motivation, goals, and several on-going research directions, and share some preliminary results.

Fuller / Teradata

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes. It was designed from the ground up with the ability to scale to the sizes required by companies such as Facebook, Airbnb, and Dropbox. Using Presto, one can query data where it lives, including Hive, Cassandra, Kafka, and relational databases. Because of its pluggable architecture, one can write a connector to query from practically any data source (https://prestodb.io).
This talk will present the Presto architecture and pluggable connectors for a variety of data sources.

Andy Mills / President/CEO and Co-Founder of enmotus

The sheer number and diversity of storage devices being deployed in large distributed data processing systems can present a huge challenge, greatly impacting system response times with little visibility into what the root cause may be. Furthermore, establishing when these devices are approaching the end of their useful life ahead of time is virtually impossible in large deployments without a large degree of human intervention.
In this session, we will review a powerful lightweight storage intelligence layer that operates invisibly at the block layers and can operate with any block or object based file system that utilizes block devices, and soon to be memory class devices also. As part of this intelligence layer, a deep learning statistical engine keeps track in real time of all storage IO operations across both time and capacity. By utilizing behavioral device analytics, we are able to visualize both locally with machine learning and remotely via a low impact reporting when a storage device is behaving badly and in the best case, completely remove this device from operation until it is fit for reinsertion of removed entirely at the next maintenance cycle. We believe having this level of automation will dramatically lower the likelihood of persistent long tail latencies caused by storage devices in systems deploying device analytics.

Amidu O Oloso / Computational Scientist, NASA

We have implemented a flexible User Defined Operator (UDO) for labeling connected components for a binary mask expressed as an array in SciDB, which is a parallel distributed database management system based on the array data model. This UDO is able to process very large multidimensional arrays by exploiting SciDB's memory management mechanism that efficiently manipulates arrays whose memory requirements far exceed available physical memory. The UDO takes as primary inputs a binary mask array and a binary stencil array that specifies the connectivity of a given cell to its neighbors. The UDO returns an array of the same shape as the input mask array with each cell containing the label of the component it belongs to. By default, dimensions are treated as non-periodic, but the UDO also accepts optional input parameters to specify periodicity in any of the dimensions. The UDO requires four stages to completely label connected components. In the first stage, labels are computed on each "chunk" (SciDB terminology for subarrays local to individual nodes) of the mask in parallel across SciDB instances using the weighted quick union (WQU) with half-path compression algorithm. In the second stage, labels around chunk boundaries from the first stage are stored in a temporary SciDB array that is then replicated across all SciDB instances. Equivalences are resolved by again applying the WQU algorithm to these boundary labels. In the third stage, relabelling is done for each chunk using the resolved equivalences. In the fourth stage, the resolved labels, which so far are "flattened" coordinates of the original binary mask array, are renamed with sequential integers for legibility. The UDO is demonstrated on a 3-D mask of size O(1011) elements, with O(108) foreground cells and O(106) connected components. The operator completes in 1.5 hours using 6 SciDB instances.

Debabrata Sarkar / Senior Engineering Manager at Oracle

Analytics is the science of examining raw data to uncover hidden patterns, correlations with the purpose of drawing conclusions about the data. It is different from traditional Business Intelligence in the sense that for traditional Business Intelligence may consist of very complex queries, but still in that case we know what we are looking for. On the other hand, Analytics answers questions that we can not formulate properly - for example, how many different type of customers we have and what are their attributes or which three medicines taken together are most likely to cause severe side effect.

To uncover hidden patterns, the analytic algorithms (for example K-Means clustering) need to scan data many times and perform compute heavy operations. To execute these algorithms in a reasonable time, often people use cluster of machines, which provide both large memory and enough computation power and use a parallel processing framework like Hadoop.

However, recently another approach is emerging, which is using special type of CPUs which can do some type of parallel operations blazing fast. Machines built using the special CPUs can replace smaller clusters or reduce the large clusters to more manageable size.

During this presentation, I will talk about special analytic instructions in Oracle's latest SPARC-M7processor and its ability to query compressed data, effectively increasing the memory size by many times. I will share some performance results, where one gets 5X to 20X for well known machine learning algorithms.

Edward Ma / Hewlett Packard Enterprise (HPE)

We present a Vertica data connector for Spark that integrates with Spark's Datasource API so that data from Vertica can be efficiently loaded into Spark. A simple JDBC Spark Datasource that loads all data of a Vertica table into Spark is not optimal, because it does not take advantage of the pre-processing capabilities of Vertica to reduce the amount of data to be transferred or leverage parallel processing effectively. Our solution connects the computational pipelines of Spark and Vertica in an optimized way, and not only utilizes parallel channels for data movement, but also (a) pushes computation down into Vertica when appropriate, and (b) maintains data-locality when transferring data between the two systems.
Operations on structured data (such as those operations expressed in Spark SQL) can be processed by Vertica, Spark, or a combination of both. The connector controls Vertica's table data flowing through query execution plans in parallel; the data is then transferred into Spark's pipeline. Our push-down optimizations identify opportunities to reduce the data volume transferred by allowing Vertica to pre-process the filter, project, join and group-by operators before passing the data into Spark.
When using a simple connection scheme, parallel connections to Vertica quickly saturate the network bandwidth of the database nodes, becoming the bottleneck due to inter-Vertica-node shuffling. This happens because each Spark task pulls a specific partition (range) of the input data, which is typically scattered across the Vertica cluster. To address this, we have devised an innovative solution to reduce data movement within Vertica.
Using a consistent hash ring, the connector guarantees that there is no unnecessary data shuffling inside Vertica, minimizing network bandwidth and optimizing the data flow from Vertica to Spark. Each query execution plan inside Vertica only targets the data (i.e., segment of the data) that is local to each node.

Tom Plunkett, Oracle

Big Data has been defined as increasing volumes, velocities, and varieties of information, all three of which are relevant to cyber security use cases, which depend on analyzing massive streams of internet protocol information in order to detect and defend against a wide range of threats. Applying Big Data to Cyber Security results in new capabilities to defend information technology assets. This presentation covers the compelling value proposition and how Big Data can fit into a typical data center's security methodology.

Zuhair Yarub Khayyat / InfoCloud Research Group

Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B-tree, R*-tree and Bitmap, inequality joins have received little attention. In other words, queries containing such joins are usually very slow and do not scale with big datasets.
In this talk, we introduce fast and scalable inequality join algorithms. We put columns to be joined in sorted arrays and we use permutation arrays to encode positions of tuples in one sorted array w.r.t. the other sorted array. In contrast to sort-merge join, we use space efficient bit-arrays that enable optimizations, such as Bloom filter indices, for fast computation of the join results. We have implemented a centralized version of these algorithms on top of PostgreSQL, and a distributed version on top of Spark SQL. We have compared against well known optimization techniques for inequality joins and show that our solution is more scalable and several orders of magnitude faster. Up to our knowledge, our solution is the first that is able to process inequality joins efficiently on datasets with billions of rows.

Chiyoung Seo / Software Architect, Couchbase

B+-tree has been used as one of the main index structures in database fields for a long period. However, with the unprecedented amount of data being generated by modern, global-scale web, mobile, and IoT applications, typical B+-tree implementations are beginning to show scalability and performance issues. Various key-value storage engines with variants of B+-tree such as log-structured merge tree (LSM-tree), have been proposed to address these limitations. At Couchbase, we have been working on a new key-value storage engine, ForestDB, that uses a new hybrid indexing scheme called HB+-trie, which is a disk-based trie-like structure combined with B+-trees. In this presentation, we introduce ForestDB and briefly discuss why ForestDB is a good fit for modern big data applications.

ZBaranowski / Researcher at CERN

Executing analytics queries at scale was never easy to achieve with relational database systems that run on a shared storage infrastructure. However there are certain tools and techniques that in a relatively simple way, without need of redesigning entire architecture of a system can offload data and SQL queries from RDBMSs like Oracle to Hadoop. This presentation will discuss the technologies and configurations that has been evaluated at CERN for bridging Oracle and Hadoop order to get the best of both worlds - relational, transactional system with ability to perform massive data processing.

Weiwei Gong / Senior Member of Technical Staff at Oracle

Since 2010, engineers from Oracle and Sun have been working together to integrate our software and hardware and wire database functionality directly into our on-die streaming co-processor. By fast processing in-memory compressed columnar format, the co-processor increases the effective memory bandwidth, while leaving the general purpose processors free to execute higher level logic. This functionality was released in the SPARC M7 last year.
In this lightening talk, we will illustrate how we decompose SQL operators into portable vector primitives which we implement in optimized software and on our co-processor, and show the co-processor are several times faster and many times more energy efficient than the optimized implementation on a general purpose core. Moreover, the decomposition of SQL operators into primitives leads to a second level of query planning and optimization, as we discover the optimal combination of primitives varies for different data distributions and hardware environments.

Gerard Lemson / Research Scientist - IDIES, John Hopkins University

There is an increasing need to provide a portal for public, interactive analyses of large numerical simulations, often exceeding several hundred TB in size. These require very localized access patterns. 3D spatial indexing with space-filling curves can easily be mapped onto B-tree-based range queries in relational databases. In particular, linking ""landmark regions"" extracted in post-processing to the raw underlying data is quite trivial. Examples are dark matter halos extracted from large cosmological N-Body simulations, or vortex tubes in simulation of isotropic turbulence. While querying such databases is very fast, loading them gets very cumbersome once data exceeds a few hundred TB. Here we present one solution to this problem, coined FileDB, which is to leave the raw data outside the database in their original binary format and provide functions (implemented in SQLCLR/C#) to access the files efficiently. Our first target data are the Millennium suite of cosmological simulations, the largest of which has 300 billion particles. These data can be treated efficiently because they are highly structured, with particles organized according to a space filling curve and/or according to the landmark (""dark halos"") they belong to. In fact, ingesting them in a relational database would lose this organization, which would require extra columns and an increase in data size due to the page-level storage of many database systems.

Clark Gaylord

The Virginia Connected Corridors (VCC) is a new initiative developed by the Virginia Tech transportation Institute (VTTI) in partnership with the Virginia Department of Transportation. The VCC encompasses both the Virginia Smart Road and the Northern Virginia Connected-vehicle Test Bed, which is located along one of the most congested corridors in the United States. To harness data from this facility, we orchestrate analytic pipelines to ensure both rapid availability of data and ingestion into VTTI's research data repository.

Join XLDB on LinkedIn
Follow XLDBConf on Twitter
Platinum Sponsors
Gold Sponsors

Oracle Logo

Silver Sponsors

Monetdb Logo

Couch Logo

Other Support

National Science Foundation

Stanford University does not purchase or endorse any sponsors’ products or services.
Privacy Statement -