HydraCube - Parallel Aggregation Engine

Home : Sparsity & Data Explosion

What is Sparsity

Read OLAP Data Scalability
[Local cache of this document]

In OLAP cube, cross product of dimensional members form the intersections for measure data. But in reality most of the intersections will not have data. This leads to sparsity. Input data or base data (i.e. before calculated hierarchies or levels) in OLAP applications is typically sparse (not densely populated). Also, as the number of dimensions increase, data will typically become sparser (less dense).

Superficially, any multi-dimensional model needs to provide space for every possible combination of data points. Since in sparse models most data points are zeros, the main issue is how to store all values other than zero values. For example, if the data density of a model is 1% and there is no sparsity handling, the resulting model will be 100 times larger than a model that has perfect sparsity handling. Sparsity handling therefore is the efficient storage of very sparse data.

Sparsity Handling

Read An Infrastructure for Scalable Parallel Multidimensional Analysis
[Local cache of this document]

"Traditional multidimensional databases store data in multidimensional arrays on which analytical operations are performed. Multidimensional arrays are good to store dense data, but most datasets are sparse in practice for which other efficient storage schemes are required. It is important to weigh the trade-offs involved in reducing the storage space versus the increase in access time for each sparse data structure, in comparison to multidimensional arrays. These trade-offs are dependent on many parameters some of which are (1) number of dimensions, (2) sizes of dimensions and (3) degree of sparsity of the data.

Complex operations such as required for OLAP can be very expensive in terms of data access time if efficient data structures are not used. Sparse data structures such as the R-tree and its variants have been used for OLAP. Range queries with a lot of unspecified dimensions are expensive because many paths have to be traversed in the tree to calculate aggregates. Chunking has been used in applications with dense and sparse chunks. Sparse chunks store an Offset-Value pair for the data present. Dimensional operations on these require materializing the sparse chunk into a multi-dimensional array and performing array operations on it. For a high number of dimensions this might not be possible since the materialized chunk may not fit in memory. The sparse dimensions use a sparse index structure to index into the dense blocks of data stored as multi-dimensional arrays. Further, none of these address parallelism and scalability to large data sets in a high number of dimensions."

Sparsity Handling in HydraCube

Sanjay Goil & Alok Choudhary [Department of Electrical & Computer Engineering, Northwestern University] proposed the idea of using a different data structure for efficient handling of sparsity. A novel data structure using bit encodings for dimension indices called Bit-Encoded Sparse Structure (BESS) is used to store sparse data in chunks, which supports fast OLAP query operations on sparse data using bit operations without the need for exploding the sparse data into a multidimensional array.

HydraCube is built with this methodology and is deviating from it slightly because of implementation constraints and to make it simpler. Refer the above mentioned literature for more information on BESS, chunk and multidimensional queries.

HydraCube treat every chunk as a sparse chunk and store it using BESS. This is because of the reason that we believe the input data is always sparse enough that it may not be worth considering a dense scenario.

What is Data Explosion

Read OLAP Data Scalability
[Local cache of this document]

Data explosion is the phenomenon that occurs in multidimensional models where the derived or calculated values significantly exceed the base values. There are three main factors that contribute to data explosion.

Sparsely populated base data increases the likelihood of data explosion
Many dimensions in a model increase the likelihood of data explosion
A high number of calculated levels in each dimension increase the likelihood of data explosion

Explosion Handling in HydraCube

HydraCube does not pre calculate the aggregate values and store either in disk or in memory. It aggregates the data on demand. This implies that there is no issue of data explosion at all and HydraCube takes very little memory for its queries. Since the query is distributed and each processor is addressing its chunk's aggregation, HydraCube performance is also acceptable. We acknowledge the fact that it will be poorer that an true in-memory OLAP but much faster than a typical ROLAP. The minimal hardware requirement and the ability to run across cluster of workstations outweigh the performance sacrifice.