The concept of "Data Lake" is becoming a widely adopted best practice in constructing an analytic ecosystem. When well executed, a data lake strategy will increase analytic agility for an organization and provide a foundation for provisioning data into both discovery and integrated data warehouse platforms. However, there are many pitfalls with this approach that can be avoided with best practices deployment. This talk will provide guidance on how to deploy toward the "data reservoir" concept rather than ending up with an all too common "data swamp".
Relational databases are known to have tightly integrated
components that make them hard to extend. This usually leads to new
databases that are specialized for scale and certain workloads.
For
its last three releases, PostgreSQL has been providing official APIs
to extend the database *without forking it*. Users can read the
documentation, implement their ideas according to the APIs, and then
dynamically load in their shared object to override or extend any
database submodule.
In this talk, we will demonstrate four example
APIs that make PostgreSQL a distributed database that focuses on
real-time workloads:
1. JSON / JSONB: This new data type enables
storing and querying semi-structured data.
2. HyperLogLog: This
extension creates a new data type, user-defined functions, and
aggregate functions to quickly calculate count(distinct)
approximations over large data sets.
3. PGStrom: This extension
offloads particular CPU intensive workloads to GPU devices and
enables running parallelized SQL on GPUs.
4. Citus: Citus shards,
replicates, and parallelizes SQL queries. It does this by hooking
into Postgres' planner and executor APIs and through transforming
queries into their parallelizable form.
In summary, PostgreSQL's extensible architecture puts it in a unique place for scaling out SQL and also for adapting to evolving hardware trends. It could just be that the monolithic SQL database is dying. If so, long live Postgres!
The Hadoop ecosystem is highly complex with many moving parts. Even with well thought out road maps and architectures there are numerous pit falls that a project team falls into when implementing Hadoop Distribution Upgrades, Security Implementations and developing Data Ingestion capabilities. This presentation will enable you to avoid these anti patterns and focus on designs that are proven to work.
In Sungkyunkwan University, Korea, we are now developing MyFlashSQL, an open-source dialect of MySQL, which is to optimize MySQL/InnoDB engine on top of flash SSDs and non-volatile RAMs. In this talk, we will introduce its motivation, goals, and several on-going research directions, and share some preliminary results.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all
sizes. It was designed from the ground up with the ability to scale to the sizes required by companies such as Facebook, Airbnb,
and Dropbox. Using Presto, one can query data where it lives, including Hive, Cassandra, Kafka, and relational databases.
Because of its pluggable architecture, one can write a connector to query from practically any data source
(https://prestodb.io).
This talk will present the Presto architecture and pluggable connectors for a variety of data sources.
The sheer number and diversity of storage devices being deployed in
large distributed data processing systems can present a huge
challenge, greatly impacting system response times with little
visibility into what the root cause may be. Furthermore,
establishing when these devices are approaching the end of their
useful life ahead of time is virtually impossible in large
deployments without a large degree of human intervention.
In this session, we will review a powerful lightweight storage
intelligence layer that operates invisibly at the block layers and
can operate with any block or object based file system that utilizes
block devices, and soon to be memory class devices also. As part of
this intelligence layer, a deep learning statistical engine keeps
track in real time of all storage IO operations across both time and
capacity.
By utilizing behavioral device analytics, we are able to
visualize both locally with machine learning and remotely via a low
impact reporting when a storage device is behaving badly and in the
best case, completely remove this device from operation until it is
fit for reinsertion of removed entirely at the next maintenance
cycle. We believe having this level of automation will dramatically
lower the likelihood of persistent long tail latencies caused by
storage devices in systems deploying device analytics.
We have implemented a flexible User Defined Operator (UDO) for labeling connected components for a binary mask expressed as an array in SciDB, which is a parallel distributed database management system based on the array data model. This UDO is able to process very large multidimensional arrays by exploiting SciDB's memory management mechanism that efficiently manipulates arrays whose memory requirements far exceed available physical memory. The UDO takes as primary inputs a binary mask array and a binary stencil array that specifies the connectivity of a given cell to its neighbors. The UDO returns an array of the same shape as the input mask array with each cell containing the label of the component it belongs to. By default, dimensions are treated as non-periodic, but the UDO also accepts optional input parameters to specify periodicity in any of the dimensions. The UDO requires four stages to completely label connected components. In the first stage, labels are computed on each "chunk" (SciDB terminology for subarrays local to individual nodes) of the mask in parallel across SciDB instances using the weighted quick union (WQU) with half-path compression algorithm. In the second stage, labels around chunk boundaries from the first stage are stored in a temporary SciDB array that is then replicated across all SciDB instances. Equivalences are resolved by again applying the WQU algorithm to these boundary labels. In the third stage, relabelling is done for each chunk using the resolved equivalences. In the fourth stage, the resolved labels, which so far are "flattened" coordinates of the original binary mask array, are renamed with sequential integers for legibility. The UDO is demonstrated on a 3-D mask of size O(1011) elements, with O(108) foreground cells and O(106) connected components. The operator completes in 1.5 hours using 6 SciDB instances.
Analytics is the science of examining raw data to uncover hidden patterns, correlations with the purpose of drawing conclusions about the data. It is different from traditional Business Intelligence in the sense that for traditional Business Intelligence may consist of very complex queries, but still in that case we know what we are looking for. On the other hand, Analytics answers questions that we can not formulate properly - for example, how many different type of customers we have and what are their attributes or which three medicines taken together are most likely to cause severe side effect.
To uncover hidden patterns, the analytic algorithms (for example K-Means clustering) need to scan data many times and perform compute heavy operations. To execute these algorithms in a reasonable time, often people use cluster of machines, which provide both large memory and enough computation power and use a parallel processing framework like Hadoop.
However, recently another approach is emerging, which is using special type of CPUs which can do some type of parallel operations blazing fast. Machines built using the special CPUs can replace smaller clusters or reduce the large clusters to more manageable size.
During this presentation, I will talk about special analytic instructions in Oracle's latest SPARC-M7processor and its ability to query compressed data, effectively increasing the memory size by many times. I will share some performance results, where one gets 5X to 20X for well known machine learning algorithms.
We present a Vertica data connector for Spark that integrates with
Spark's Datasource API so that data from Vertica can be efficiently
loaded into Spark. A simple JDBC Spark Datasource that loads all
data of a Vertica table into Spark is not optimal, because it does
not take advantage of the pre-processing capabilities of Vertica to
reduce the amount of data to be transferred or leverage parallel
processing effectively. Our solution connects the computational
pipelines of Spark and Vertica in an optimized way, and not only
utilizes parallel channels for data movement, but also (a) pushes
computation down into Vertica when appropriate, and (b) maintains
data-locality when transferring data between the two systems.
Operations on structured data (such as those operations expressed in
Spark SQL) can be processed by Vertica, Spark, or a combination of
both. The connector controls Vertica's table data flowing through
query execution plans in parallel; the data is then transferred into
Spark's pipeline. Our push-down optimizations identify opportunities
to reduce the data volume transferred by allowing Vertica to
pre-process the filter, project, join and group-by operators before
passing the data into Spark.
When using a simple connection scheme,
parallel connections to Vertica quickly saturate the network
bandwidth of the database nodes, becoming the bottleneck due to
inter-Vertica-node shuffling. This happens because each Spark task
pulls a specific partition (range) of the input data, which is
typically scattered across the Vertica cluster. To address this, we
have devised an innovative solution to reduce data movement within
Vertica.
Using a consistent hash ring, the connector guarantees that
there is no unnecessary data shuffling inside Vertica, minimizing
network bandwidth and optimizing the data flow from Vertica to
Spark. Each query execution plan inside Vertica only targets the
data (i.e., segment of the data) that is local to each node.
Big Data has been defined as increasing volumes, velocities, and varieties of information, all three of which are relevant to cyber security use cases, which depend on analyzing massive streams of internet protocol information in order to detect and defend against a wide range of threats. Applying Big Data to Cyber Security results in new capabilities to defend information technology assets. This presentation covers the compelling value proposition and how Big Data can fit into a typical data center's security methodology.
Inequality joins, which join relational tables on inequality
conditions, are used in various applications. While there have been
a wide range of optimization methods for joins in database systems,
from algorithms such as sort-merge join and band join, to various
indices such as B-tree, R*-tree and Bitmap, inequality joins have
received little attention. In other words, queries containing such
joins are usually very slow and do not scale with big datasets.
In
this talk, we introduce fast and scalable inequality join
algorithms. We put columns to be joined in sorted arrays and we use
permutation arrays to encode positions of tuples in one sorted array
w.r.t. the other sorted array. In contrast to sort-merge join, we
use space efficient bit-arrays that enable optimizations, such as
Bloom filter indices, for fast computation of the join results. We
have implemented a centralized version of these algorithms on top of
PostgreSQL, and a distributed version on top of Spark SQL. We have
compared against well known optimization techniques for inequality
joins and show that our solution is more scalable and several orders
of magnitude faster. Up to our knowledge, our solution is the first
that is able to process inequality joins efficiently on datasets
with billions of rows.
B+-tree has been used as one of the main index structures in database fields for a long period. However, with the unprecedented amount of data being generated by modern, global-scale web, mobile, and IoT applications, typical B+-tree implementations are beginning to show scalability and performance issues. Various key-value storage engines with variants of B+-tree such as log-structured merge tree (LSM-tree), have been proposed to address these limitations. At Couchbase, we have been working on a new key-value storage engine, ForestDB, that uses a new hybrid indexing scheme called HB+-trie, which is a disk-based trie-like structure combined with B+-trees. In this presentation, we introduce ForestDB and briefly discuss why ForestDB is a good fit for modern big data applications.
Executing analytics queries at scale was never easy to achieve with relational database systems that run on a shared storage infrastructure. However there are certain tools and techniques that in a relatively simple way, without need of redesigning entire architecture of a system can offload data and SQL queries from RDBMSs like Oracle to Hadoop. This presentation will discuss the technologies and configurations that has been evaluated at CERN for bridging Oracle and Hadoop order to get the best of both worlds - relational, transactional system with ability to perform massive data processing.
Since 2010, engineers from Oracle and Sun have been working
together to integrate our software and hardware and wire database
functionality directly into our on-die streaming co-processor. By
fast processing in-memory compressed columnar format, the
co-processor increases the effective memory bandwidth, while leaving
the general purpose processors free to execute higher level logic.
This functionality was released in the SPARC M7 last year.
In this lightening talk, we will illustrate how we decompose SQL
operators into portable vector primitives which we implement in
optimized software and on our co-processor, and show the
co-processor are several times faster and many times more energy
efficient than the optimized implementation on a general purpose
core. Moreover, the decomposition of SQL operators into primitives
leads to a second level of query planning and optimization, as we
discover the optimal combination of primitives varies for different
data distributions and hardware environments.
There is an increasing need to provide a portal for public, interactive analyses of large numerical simulations, often exceeding several hundred TB in size. These require very localized access patterns. 3D spatial indexing with space-filling curves can easily be mapped onto B-tree-based range queries in relational databases. In particular, linking ""landmark regions"" extracted in post-processing to the raw underlying data is quite trivial. Examples are dark matter halos extracted from large cosmological N-Body simulations, or vortex tubes in simulation of isotropic turbulence. While querying such databases is very fast, loading them gets very cumbersome once data exceeds a few hundred TB. Here we present one solution to this problem, coined FileDB, which is to leave the raw data outside the database in their original binary format and provide functions (implemented in SQLCLR/C#) to access the files efficiently. Our first target data are the Millennium suite of cosmological simulations, the largest of which has 300 billion particles. These data can be treated efficiently because they are highly structured, with particles organized according to a space filling curve and/or according to the landmark (""dark halos"") they belong to. In fact, ingesting them in a relational database would lose this organization, which would require extra columns and an increase in data size due to the page-level storage of many database systems.
The Virginia Connected Corridors (VCC) is a new initiative developed by the Virginia Tech transportation Institute (VTTI) in partnership with the Virginia Department of Transportation. The VCC encompasses both the Virginia Smart Road and the Northern Virginia Connected-vehicle Test Bed, which is located along one of the most congested corridors in the United States. To harness data from this facility, we orchestrate analytic pipelines to ensure both rapid availability of data and ingestion into VTTI's research data repository.