About Download FAQ


Home :  About
 

HydraCube is an open source, parallel software to provide scalable Online Analytical Processing (OLAP) capabilities like aggregation, slicing and dicing of multi dimensional data.  The objective is to build a parallel software which can run on cheaper commodity desktop machines as a cluster farm which are highly scalable and economical. 
 

 

What is HydraCube  
  • Is a multi dimensional OLAP (MOLAP) engine
  • Its a data parallel software. Can be run on cluster of commodity desktop machines
  • Handles sparsity very well
  • Data explosion is not an issue because aggregate data is neither pre computed nor stored
  • Aggregate data at query time and has low memory foot print
  • Supports basic Mutli Dimensional Expression (MDX) for query
  • Supports SQL like syntax for dimension and cube management
  • Supports multi user environment
  • Employs client server architecture 
  • Server is a parallel engine using message passing interface (MPI) middleware
  • Use embedded BerkeleyDB for efficient data management
  • Has a command line utility for invoking commands
  • Has client APIs for integrating with applications

 

HydraCube applicability

 

What HydraCube is not 
  • not a relational OLAP engine. Does not need any RDBMS for storing data.
  • not an OLAP graphical viewer
  • not a plug-in or module to any other OLAP framework

 

Distributed Strategy
The main challenge of multi dimensional data analysis is its huge data size and immense computations. The complexity is extreme when we consider more dimensions and the resulting intersection space. The data needs to be accessed, computed and aggregated along with all dimensions with multi level hierarchical grouping.

This demands huge processing power and memory requirement which is normally found in Symmteric Multi Processing (SMP) computers. But SMP have the following disadvantages
- Cost per computation is high
- Scalability is a question

SMP has the advantage of shared memory access for the parallel processors which gives better performance due to lower latency.

HydraCube architecture allows it to run on shared nothing desktop servers which form the cluster environment. The multi dimensional cube building, data storage, querying and computations are all distributed across the cluster nodes which gives unparalleled scalability and performance.

The downside is the network latency of the interconnect and which is addressed by the following approaches:
- Use high bandwidth interconnects like Myrinet / Giganet
- Have a distribution strategy which minimizes the data communication among the nodes
- Employ concurrent file access with data partitioning
- Employ asynchronous disk I/O which is concurrent with the computations to hide the latency

 

Design Decisions
  • Hardware : Parallel, Distributed, Shared Nothing Commodity Cluster with High Speed Interconnect.
    [Ethernet (100Mbps) / Myrinet / Gigabit (1Gbps)]
    -Or-
    Symmetric Multi Processing (SMP) Shared Memory Systems
  • File system – Native operating system file system
  • Middleware – Industry standard message passing interface (MPI)
  • Storage Format – Multi dimensional proprietary data storage with efficient sparse data compression
  • Storage Engine - Berkeley DB embedded
  • Query language – Multi Dimensional Expression (MDX)
  • Client API - C++ / Java Client API (SDK)
  • Data Import / Export – Delimited text files
  • Compiler -  C++ programming language

 

HydraCube Architecture
HydraCube server daemon, client SDK & client command shell architecture is as shown.
 



HydraCube Architecture Diagram

hydracubed is the parallel daemon software which runs across multiple shared nothing desktop servers (or on SMPs) and execute the commands.

Interconnect can be normal 100mbps ethernet, Gigabit or Myrinet for better performance and lower latency.

MPI is the messaging middleware and we use MPICH which is free from ANL. Current version of MPICH is not thread safe for concurrent execution (send / receive). We employ a comms framework to overcome this issue and also to provide us with a higher programming abstraction. 

CMD layer has the commands which are remotely executed on every node and created and send by the master process. It works along with comms layer to achieve this.

MODEL layer has all the hierarchy & cube management code along with OLAP query logic. It does all business validations and computations. It runs on every node.

DAO is an abstraction over BerkeleyDB and help us in persistence of hierarchy, cube and data chunks. 

SereverLib/Parser layer has the controller logic and the MDX parser and runs in master.

serverd/CORBA layer is the CORBA interface for clients to connect. Since it supports IIOP protocol any client (either C++ or java) can connect to the server. We use OmniORB for CORBA support.

CORBA STUB  is the client stub in C++ which client SDK can use. 

client SDK is in C++ and can be written in java without much effort.

ClientCMD is the client command line utility to execute HydraCube commands and queries. It is an interactive batch utility which accepts commands and queries, get it executed in server and display the result.

Web Service is the future implementation possibility and will be using client SDK to interact with server.

 

Deployment Architecture
Option 1: Deployed on shared nothing commodity clusters called network of workstations (NOW).

Network of Workstations (NOW)

 

Option 2: Deployed on multi processor symmetric multi processing systems (SMP)

Symmetric Multi Processing (SMP)

 

 

Copyright © 2005, ApeSoft Technologies
www.hydracube.sourceforge.net