Approximation of Iceberg Cubes

Using Data Reduction Techniques

Writen by Leonid BukhmanUsing Data Reduction Techniques

Advisor: Hervé Brönnimann

Submitted in Partial Fulfillment of the Requirements

for the Degree of Master of Science (Computer Science)

June 2005

Source Code (Perl): [file1], [file2], [file3]

Data reduction techniques have often been used to alleviate the scalability problem of processing large datasets. Scalability becomes an even greater concern when managing multidimensional datasets and data reduction plays a much more significant role as part of the solution to this problem. This thesis explores the use of deterministic sampling algorithms to create a data sample that can be used as a surrogate for a multidimensional dataset. The sample can then be utilized to build an iceberg-cube of frequency counts that is a close approximation to one built from the original data. Techniques for approximation of frequency counts are applied to reduce the memory complexity of the algorithm. The resulting algorithms run in a single pass over the data with a small memory footprint and within the asymptotic bounds of the algorithms from which they are derived. These properties make these sampling algorithms ideal for data stream applications such as network traffic analysis.

A performance metric is defined which
tests the quality of the sample produced by comparing it to the quality of a random
sample of equivalent size. An iceberg-cube generator is also used to produce large
random datasets for testing the sampling algorithms under various conditions and
establishing their baseline behavior. Results show that the MinSupport_Biased-L2
algorithm is most suitable for the approximation of iceberg-cubes with a running time
complexity of O(2^{d} x
N) and a memory complexity of O(2^{d}/s x log(εN)), where N is the size of the dataset, d is the
number of dimensions, s is the minimum support, and ε is the error bound on the
frequency approximation. Empirical testing showed that this algorithm performs significantly
better than an equivalent size random sample and meets the goals and expectations
defined for this thesis.

- Efficient Data Reduction with EASE.
Hervé Brönnimann, Bin Chen, Manoranjan Dash, Peter Haas, and Peter Scheuerman. Proc. ACM Symp. on
Knowledge Discovery and Data Mining (KDD), Washington DC, August 2003, pp. 59-68.
- Approximate Frequency Counts Over Data Streams G. S. Manku and R. Motwani. Proc. 28th VLDB Conference, 2002.