HydraCube - Parallel Aggregation Engine

Home : About

HydraCube is an open source, parallel software to provide scalable Online Analytical Processing (OLAP) capabilities like aggregation, slicing and dicing of multi dimensional data. The objective is to build a parallel software which can run on cheaper commodity desktop machines as a cluster farm which are highly scalable and economical.

What is HydraCube
What HydraCube is not
Distributed Strategy
Design Decisions
HydraCube Architecture
Deployment Architecture
Sparsity And Data Explosion

What is HydraCube

Is a multi dimensional OLAP (MOLAP) engine
Its a data parallel software. Can be run on cluster of commodity desktop machines
Handles sparsity very well
Data explosion is not an issue because aggregate data is neither pre computed nor stored
Aggregate data at query time and has low memory foot print
Supports basic Mutli Dimensional Expression (MDX) for query
Supports SQL like syntax for dimension and cube management
Supports multi user environment
Employs client server architecture
Server is a parallel engine using message passing interface (MPI) middleware
Use embedded BerkeleyDB for efficient data management
Has a command line utility for invoking commands
Has client APIs for integrating with applications

HydraCube applicability

What HydraCube is not

not a relational OLAP engine. Does not need any RDBMS for storing data.
not an OLAP graphical viewer
not a plug-in or module to any other OLAP framework

Distributed Strategy

The main challenge of multi dimensional data analysis is its huge data size and immense computations. The complexity is extreme when we consider more dimensions and the resulting intersection space. The data needs to be accessed, computed and aggregated along with all dimensions with multi level hierarchical grouping.

This demands huge processing power and memory requirement which is normally found in Symmteric Multi Processing (SMP) computers. But SMP have the following disadvantages
- Cost per computation is high
- Scalability is a question

SMP has the advantage of shared memory access for the parallel processors which gives better performance due to lower latency.

HydraCube architecture allows it to run on shared nothing desktop servers which form the cluster environment. The multi dimensional cube building, data storage, querying and computations are all distributed across the cluster nodes which gives unparalleled scalability and performance.

The downside is the network latency of the interconnect and which is addressed by the following approaches:
- Use high bandwidth interconnects like Myrinet / Giganet
- Have a distribution strategy which minimizes the data communication among the nodes
- Employ concurrent file access with data partitioning
- Employ asynchronous disk I/O which is concurrent with the computations to hide the latency

Design Decisions

Hardware : Parallel, Distributed, Shared Nothing Commodity Cluster with High Speed Interconnect.
[Ethernet (100Mbps) / Myrinet / Gigabit (1Gbps)]
-Or-
Symmetric Multi Processing (SMP) Shared Memory Systems
File system – Native operating system file system
Middleware – Industry standard message passing interface (MPI)
Storage Format – Multi dimensional proprietary data storage with efficient sparse data compression
Storage Engine - Berkeley DB embedded
Query language – Multi Dimensional Expression (MDX)
Client API - C++ / Java Client API (SDK)
Data Import / Export – Delimited text files
Compiler - C++ programming language

HydraCube Architecture

HydraCube server daemon, client SDK & client command shell architecture is as shown.

HydraCube Architecture Diagram

hydracubed is the parallel daemon software which runs across multiple shared nothing desktop servers (or on SMPs) and execute the commands.

Interconnect can be normal 100mbps ethernet, Gigabit or Myrinet for better performance and lower latency.

MPI is the messaging middleware and we use MPICH which is free from ANL. Current version of MPICH is not thread safe for concurrent execution (send / receive). We employ a comms framework to overcome this issue and also to provide us with a higher programming abstraction.

CMD layer has the commands which are remotely executed on every node and created and send by the master process. It works along with comms layer to achieve this.

MODEL layer has all the hierarchy & cube management code along with OLAP query logic. It does all business validations and computations. It runs on every node.

DAO is an abstraction over BerkeleyDB and help us in persistence of hierarchy, cube and data chunks.

SereverLib/Parser layer has the controller logic and the MDX parser and runs in master.

serverd/CORBA layer is the CORBA interface for clients to connect. Since it supports IIOP protocol any client (either C++ or java) can connect to the server. We use OmniORB for CORBA support.

CORBA STUB is the client stub in C++ which client SDK can use.

client SDK is in C++ and can be written in java without much effort.

ClientCMD is the client command line utility to execute HydraCube commands and queries. It is an interactive batch utility which accepts commands and queries, get it executed in server and display the result.

Web Service is the future implementation possibility and will be using client SDK to interact with server.

Deployment Architecture

Option 1: Deployed on shared nothing commodity clusters called network of workstations (NOW).

Network of Workstations (NOW)

Option 2: Deployed on multi processor symmetric multi processing systems (SMP)

Symmetric Multi Processing (SMP)