Home : Sparsity & Data
Explosion


What is
Sparsity 
Read OLAP
Data Scalability
[Local cache of this
document]
In OLAP cube, cross product of dimensional
members form the intersections for measure data. But in
reality most of the intersections will not have data. This
leads to sparsity. Input data or base data (i.e. before
calculated hierarchies or levels) in OLAP applications is
typically sparse (not densely populated). Also, as the number
of dimensions increase, data will typically become sparser
(less dense).
Superficially, any multidimensional model
needs to provide space for every possible combination of data
points. Since in sparse models most data points are zeros, the
main issue is how to store all values other than zero values.
For example, if the
data density of a model is 1% and there is no sparsity
handling, the resulting model will be 100 times larger than a
model that has perfect sparsity handling. Sparsity
handling therefore is the efficient storage of very sparse
data.

Sparsity
Handling 
Read An Infrastructure for Scalable Parallel Multidimensional Analysis
[Local cache of this document]
"Traditional multidimensional databases store data in multidimensional arrays on which analytical operations are performed. Multidimensional arrays are good to store dense data, but most datasets are sparse in practice for which other efficient storage schemes are required. It is important to weigh the tradeoffs involved in reducing the storage space versus the increase in access time for each sparse data structure, in comparison to multidimensional arrays. These tradeoffs are dependent on many parameters some of which are (1) number of dimensions, (2) sizes of dimensions and (3) degree of
sparsity of the data.
Complex operations such as required for OLAP can be very expensive in terms of data access time if efficient data structures are not used. Sparse data structures such as the Rtree and its variants have been used for OLAP. Range queries with a lot of unspecified dimensions are expensive because many paths have to be traversed in the tree to calculate aggregates. Chunking has been used in
applications with dense and sparse chunks. Sparse chunks store an OffsetValue pair for the data present. Dimensional operations on these require materializing the sparse chunk into a multidimensional array and performing array operations on it. For a high number of dimensions this might not be possible since the materialized chunk may not fit in memory. The sparse dimensions use a sparse index structure to index into the dense blocks of data stored as multidimensional arrays. Further, none of these address parallelism and scalability to large data sets in a high number of dimensions."

Sparsity
Handling in HydraCube 
Sanjay Goil & Alok Choudhary [Department
of Electrical & Computer Engineering, Northwestern
University] proposed the idea of using a different data
structure for efficient handling of sparsity. A novel data structure using bit encodings for dimension indices called
BitEncoded Sparse Structure (BESS) is used to store sparse data in chunks, which supports fast OLAP query operations on sparse data using bit operations without the need for exploding the sparse data into a multidimensional
array.
HydraCube is built with this methodology and is deviating
from it slightly because of implementation constraints and to
make it simpler. Refer the above mentioned literature for more
information on BESS, chunk and multidimensional queries.
HydraCube treat
every chunk as a sparse chunk and store it using BESS.
This is because of the reason that we believe the input data
is always sparse enough that it may not be worth considering a
dense scenario.

What is Data
Explosion 
Read OLAP
Data Scalability
[Local cache of this
document]
Data explosion is the phenomenon that occurs in multidimensional models where the
derived or calculated values significantly exceed the base values. There are three main
factors that contribute to data explosion.
 Sparsely populated base data increases the likelihood of data explosion
 Many dimensions in a model increase the likelihood of data explosion
 A high number of calculated levels in each dimension increase the likelihood of
data explosion

Explosion
Handling in HydraCube 
HydraCube does not pre calculate
the aggregate values and store either in disk or in memory. It
aggregates the data on demand. This implies that there is no
issue of data explosion at all and HydraCube takes very little
memory for its queries. Since the query is distributed and
each processor is addressing its chunk's aggregation,
HydraCube performance is also acceptable. We acknowledge the
fact that it will be poorer that an true inmemory OLAP but
much faster than a typical ROLAP. The minimal hardware
requirement and the ability to run across cluster of
workstations outweigh the performance sacrifice.

