Home : Sparsity & Data
[Local cache of this
In OLAP cube, cross product of dimensional
members form the intersections for measure data. But in
reality most of the intersections will not have data. This
leads to sparsity. Input data or base data (i.e. before
calculated hierarchies or levels) in OLAP applications is
typically sparse (not densely populated). Also, as the number
of dimensions increase, data will typically become sparser
Superficially, any multi-dimensional model
needs to provide space for every possible combination of data
points. Since in sparse models most data points are zeros, the
main issue is how to store all values other than zero values.
For example, if the
data density of a model is 1% and there is no sparsity
handling, the resulting model will be 100 times larger than a
model that has perfect sparsity handling. Sparsity
handling therefore is the efficient storage of very sparse
|Read An Infrastructure for Scalable Parallel Multidimensional Analysis
[Local cache of this document]
"Traditional multidimensional databases store data in multidimensional arrays on which analytical operations are performed. Multidimensional arrays are good to store dense data, but most datasets are sparse in practice for which other efficient storage schemes are required. It is important to weigh the trade-offs involved in reducing the storage space versus the increase in access time for each sparse data structure, in comparison to multidimensional arrays. These trade-offs are dependent on many parameters some of which are (1) number of dimensions, (2) sizes of dimensions and (3) degree of
sparsity of the data.
Complex operations such as required for OLAP can be very expensive in terms of data access time if efficient data structures are not used. Sparse data structures such as the R-tree and its variants have been used for OLAP. Range queries with a lot of unspecified dimensions are expensive because many paths have to be traversed in the tree to calculate aggregates. Chunking has been used in
applications with dense and sparse chunks. Sparse chunks store an Offset-Value pair for the data present. Dimensional operations on these require materializing the sparse chunk into a multi-dimensional array and performing array operations on it. For a high number of dimensions this might not be possible since the materialized chunk may not fit in memory. The sparse dimensions use a sparse index structure to index into the dense blocks of data stored as multi-dimensional arrays. Further, none of these address parallelism and scalability to large data sets in a high number of dimensions."
Handling in HydraCube
|Sanjay Goil & Alok Choudhary [Department
of Electrical & Computer Engineering, Northwestern
University] proposed the idea of using a different data
structure for efficient handling of sparsity. A novel data structure using bit encodings for dimension indices called
Bit-Encoded Sparse Structure (BESS) is used to store sparse data in chunks, which supports fast OLAP query operations on sparse data using bit operations without the need for exploding the sparse data into a multidimensional
HydraCube is built with this methodology and is deviating
from it slightly because of implementation constraints and to
make it simpler. Refer the above mentioned literature for more
information on BESS, chunk and multidimensional queries.
every chunk as a sparse chunk and store it using BESS.
This is because of the reason that we believe the input data
is always sparse enough that it may not be worth considering a
|What is Data
[Local cache of this
Data explosion is the phenomenon that occurs in multidimensional models where the
derived or calculated values significantly exceed the base values. There are three main
factors that contribute to data explosion.
- Sparsely populated base data increases the likelihood of data explosion
- Many dimensions in a model increase the likelihood of data explosion
- A high number of calculated levels in each dimension increase the likelihood of
Handling in HydraCube
|HydraCube does not pre calculate
the aggregate values and store either in disk or in memory. It
aggregates the data on demand. This implies that there is no
issue of data explosion at all and HydraCube takes very little
memory for its queries. Since the query is distributed and
each processor is addressing its chunk's aggregation,
HydraCube performance is also acceptable. We acknowledge the
fact that it will be poorer that an true in-memory OLAP but
much faster than a typical ROLAP. The minimal hardware
requirement and the ability to run across cluster of
workstations outweigh the performance sacrifice.