Approximation of Iceberg Cubes
Using Data Reduction Techniques

Writen by Leonid Bukhman
Advisor: Hervé Brönnimann

Submitted in Partial Fulfillment of the Requirements
for the Degree of Master of Science (Computer Science)
June 2005

Submitted Version: [PDF]
Source Code (Perl): [file1], [file2], [file3]

Abstract:

Data reduction techniques have often been used to alleviate the scalability problem of processing large datasets. Scalability becomes an even greater concern when managing multidimensional datasets and data reduction plays a much more significant role as part of the solution to this problem. This thesis explores the use of deterministic sampling algorithms to create a data sample that can be used as a surrogate for a multidimensional dataset. The sample can then be utilized to build an iceberg-cube of frequency counts that is a close approximation to one built from the original data. Techniques for approximation of frequency counts are applied to reduce the memory complexity of the algorithm. The resulting algorithms run in a single pass over the data with a small memory footprint and within the asymptotic bounds of the algorithms from which they are derived. These properties make these sampling algorithms ideal for data stream applications such as network traffic analysis.

A performance metric is defined which tests the quality of the sample produced by comparing it to the quality of a random sample of equivalent size. An iceberg-cube generator is also used to produce large random datasets for testing the sampling algorithms under various conditions and establishing their baseline behavior. Results show that the MinSupport_Biased-L2 algorithm is most suitable for the approximation of iceberg-cubes with a running time complexity of O(2d x N) and a memory complexity of O(2d/s x log(εN)), where N is the size of the dataset, d is the number of dimensions, s is the minimum support, and ε is the error bound on the frequency approximation. Empirical testing showed that this algorithm performs significantly better than an equivalent size random sample and meets the goals and expectations defined for this thesis.

Related Publications: