Professional Documents
Culture Documents
by
Doctor of Philosophy
August 2010
Hazel
c Jane Webb, 2010
Abstract
Queries that constrain multiple dimensions simultaneously are difficult to express, in both
Structured Query Language (SQL) and multidimensional languages. Moreover, they can
be inefficient. We introduce the diamond cube operator to facilitate the expression and
computation of one such important class of multidimensional query, and we define the
resulting cube to be a diamond.
The diamond cube operator identifies an important cube such that all remaining attributes
are strongly related. For example, suppose a company wants to close shops and termi-
nate product lines, whilst meeting some profitability constraints simultaneously over both
products and shops. The diamond operator would identify the maximal set of products
and shops that satisfy those constraints.
We determine the complexity of computing diamonds and investigate how pre-computed
materialised views can speed execution. Views are defined by the parameter k associated
with each dimension in every data cube. We prove that there is only one k1 , k2 , . . . , kd -
diamond for a given cube. By varying the ki ’s we get a collection of diamonds for a cube
and these diamonds form a lattice.
We also determine the complexity of discovering the most-constrained diamond that has
a non-empty solution. By executing binary search over theoretically-derived bounds, we
compute such a diamond efficiently.
Diamonds are defined over data cubes where the measures all have the same sign. We
prove that the more general case—computing a diamond where measures include both
positive and negative values—is NP-hard.
Finding dense subcubes in large data is a difficult problem. We investigate the role that di-
amonds play in the solution of three NP-hard problems that seek dense subcubes: Largest
ii
Perfect Cube, Densest Cube with Limited Dimensions and Heaviest Cube with Limited
Dimensions.
We are interested in processing large data sets. We validated our algorithms on a variety of
freely-available and synthetically-generated data cubes, whose dimensionality range from
three to twenty-seven. Most of the cubes contain more than a million facts, and the largest
has more than 986 million facts. We show that our custom implementation is more than
twenty-five times faster, on a large data set, than popular database engines.
iii
Dedication
I dedicate this dissertation to my family, whose love and support were invaluable. Most of
all, this is for my best friend and greatest supporter, my husband Phillip. Without his faith
in me during times of self-doubt, I would never have completed this work.
iv
Acknowledgements
Obviously, this work was not done in isolation, and it is with pleasure and gratitude that
I acknowledge the support of many individuals. I have, first and foremost, to thank my
supervisors, Dr. Owen Kaser and Dr. Daniel Lemire, for their mentoring, guidance and
encouragement. I am particularly grateful for their patience when I took a long time
to understand a particular concept, or stubbornly went off on a tangent that they felt was
fruitless. Thanks to them this dissertation “tells a story” in a precise, and I hope, interesting
way. I hope that I will be able to provide the same kind of support and guidance to future
graduate students.
I must thank the members of the examining board; in particular Dr Tim Alderson for his
suggestions of how to simplify some of the proofs and Dr Andrew Rau-Chaplin for his
kind comments and suggestions for improvements.
Special thanks must also go to Dr Ruth Shaw, my MCS supervisor. Her positive attitude
and confidence in my abilities encouraged me to progress. She is a great role model and
mentor to all female students.
Every journey begins with a single step and this journey would never have begun if I had
not taken a course in Discrete Mathematics from Dr Larry Garey, now professor emeritus
at UNB Saint John. He introduced me to interesting and exciting topics and made me
realise that studying Computer Science was for me. Thankyou, Larry, for inspiring me
and many generations of other students.
v
Table of Contents
Abstract ii
Dedication iv
Acknowledgements v
Table of Contents vi
List of Tables xi
Notation xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Sub-sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
vi
2.4.1 Skyline Operator . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Top-k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.3 Formal Concept Analysis . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Applications for Diamond Cubes . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.2 Dense Subcubes . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Algorithms 44
4.1 Main Memory Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 External-Memory Algorithms . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 A Hash-Table-Based Algorithm . . . . . . . . . . . . . . . . . . 51
4.2.2 Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.3 A Sorting-Based Algorithm . . . . . . . . . . . . . . . . . . . . 53
4.2.4 An SQL-Based Algorithm . . . . . . . . . . . . . . . . . . . . . 55
4.3 Finding the Largest Number of Carats,κ(C) . . . . . . . . . . . . . . . . 58
5 Experiments 62
5.1 Hardware and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
vii
5.3 Data Used in Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.1 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
TWEED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Netflix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Census Income . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Census 1881 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
King James Bible . . . . . . . . . . . . . . . . . . . . . . . . . 67
Forest Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Weather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.2 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Preprocessing Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5 Finding κ(C) for COUNT-based Diamonds . . . . . . . . . . . . . . . . . 73
5.5.1 King James Bible . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5.2 Other Cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6 Finding κ(C) for SUM-based Diamonds . . . . . . . . . . . . . . . . . . 76
5.7 Comparison of Algorithm Speeds . . . . . . . . . . . . . . . . . . . . . 78
5.8 Diamond Size and Dimensionality . . . . . . . . . . . . . . . . . . . . . 83
5.9 Iterations to Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.10 Robustness Against Randomly Missing Data . . . . . . . . . . . . . . . . 86
6 Related Problems 89
6.1 Largest Perfect Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Heaviest Cube with Limited Dimensions . . . . . . . . . . . . . . . . . . 90
6.3 Densest Cube with Limited Dimensions . . . . . . . . . . . . . . . . . . 92
6.3.1 A Related Graph Problem . . . . . . . . . . . . . . . . . . . . . 94
6.4 Effectiveness of DCLD Heuristic . . . . . . . . . . . . . . . . . . . . . . 94
viii
7 Conclusion 99
7.1 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.1 Faster Computation . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.2 Generalisations of Diamonds . . . . . . . . . . . . . . . . . . . . 102
7.1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1.4 Other Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Glossary 108
Appendices 111
Bibliography 140
Curriculum Vitæ
ix
List of Tables
2.1 A 3-dimensional relation with closed 3-set {(α, γ)(1, 2)(A, B)}. . . . . . 18
5.1 Cube K2: the number of duplicates was recorded and used as the measure
for the SUM-diamond. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Statistics of TWEED, Netflix and census data. . . . . . . . . . . . . . . . 66
5.3 Statistics of forest, weather and King James Bible data. . . . . . . . . . . 67
5.4 Statistics of the synthetic data cubes. . . . . . . . . . . . . . . . . . . . 69
5.5 Preprocessing times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.6 The modified Pegasus search was no more efficient than binary search. . . 74
5.7 The number of iterations it took to determine κ for COUNT-based diamonds. 75
5.8 The number of iterations it took to determine κ for SUM-based diamonds . 75
5.9 Values remaining in the K3 κ-diamond. . . . . . . . . . . . . . . . . . . 76
5.10 Statistics for real-data cubes and their respective κ-diamonds. . . . . . . . 77
5.11 Comparison of string and binary algorithm speeds. . . . . . . . . . . . . 78
5.12 Relative slowdown of algorithms compared to the binary version. . . . . . 82
5.13 Comparison of preprocessing times for SQL implementations . . . . . . . 82
5.14 Comparison of SQL implementations . . . . . . . . . . . . . . . . . . . 83
5.15 Slowdown of the SQL algorithms . . . . . . . . . . . . . . . . . . . . . 83
x
5.16 High dimensionality does not affect diamond size. . . . . . . . . . . . . . 84
5.17 Robustness of κ(TW1) under various amounts of randomly missing data . 88
xi
List of Figures
xii
3.8 Data cube with positive and negative measures. . . . . . . . . . . . . . . 40
3.9 SUM -cube (k = 3) generated from Figure 3.8. Rows processed first. . . . 40
3.10 SUM -cube (k = 3) generated from Figure 3.8. Columns processed first. . 40
3.11 There is no unique diamond for κ = 5. Rows 1 and 2 could be reinstated
after deletion of column 1. . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.12 An instance of MTP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.13 An MTP instance s1 = 3, s2 = n = 5. The k1 = 7, k2 = 5-diamond is
shaded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.1 Example showing that a diamond (bottom-right quadrant) may not have
optimal density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.1 Sample input line from weather_large.csv and its corresponding output line.117
xiii
C.1 Cells remaining in DCLD1 after the duplicates were removed. . . . . . . 124
C.2 Cells remaining in DCLD2 after the duplicates were removed. . . . . . . 124
xiv
Notation
C a data cube 24
xv
Chapter 1
Introduction
Queries that constrain multiple dimensions simultaneously are difficult to express and
compute efficiently in both Structured Query Language (SQL) and multidimensional lan-
guages. We introduce the diamond cube operator to facilitate the expression of one such
class of multidimensional query. For example, we compute the (400 000,100 000)-
diamond to solve the following diamond query from the business analysis domain.
Example 1.1. A company wants to close shops and terminate product lines simultane-
ously. The CEO wants a maximal set of products and shops such that each shop would
have sales greater than $400 000, and each product would bring revenues of at least
$100 000—assuming we terminate all other shops and product lines.
Diamonds are intrinsically multidimensional and, because of the interaction between di-
mensions, the computation of diamond cubes is challenging. We determine the complexity
of computing diamonds and investigate how pre-computed materialised views can speed
execution. We also determine the complexity of discovering the most-constrained dia-
mond that has a non-empty solution. By executing binary search over theoretically-derived
bounds we can compute such a diamond efficiently.
Diamonds are defined over data cubes where the measures all have the same sign. We
prove that the more general case—computing a diamond where measures include both
1
positive and negative values—is NP-hard.
We present diamond cube experiments on large data sets of more than 100 million rows.
Our custom implementation generates diamonds more than twenty-five times faster, on a
large data set, than popular database engines.
In Chapter 2, background and related work provide a context for the proposed diamond
cube operator. The diamond cube is formally defined in Chapter 3 and we show that dia-
monds are nested and there is a unique diamond for any integer value k ≥ 0. In Chapter 4,
we discuss the complexity of various algorithms designed to compute diamonds. In Chap-
ter 5, we present the results from our experimentation with both real and systematically-
generated synthetic data sets. In Chapter 6 we investigate the role that diamonds might
play in the solution of three NP-hard problems:
1.1 Motivation
The desire to explore and discover new information from within the vast quantities of data
available in disparate fields has motivated research in many areas: On-Line Analytical Pro-
cessing (OLAP) [Morf 07], databases [Hale 01,Ilya 08] and data mining [Chak 06,Stav 07]
amongst others. The motivation for this dissertation is a desire to find a time-efficient im-
plementation for a particular type of multidimensional query: one that can be executed
simultaneously against each dimension of a large data set. For example, Hirsch [Hirs 05]
introduced a measure—the h-index—to characterise the scientific output of a researcher.
2
The h-index for a given author is the maximum h such that she has h papers each cited at
least h times. The h-index provides a balanced view of an author’s work in that it neither
rewards authors with large publication lists of papers that no-one cites, nor authors with
only one or two widely-cited papers. Calculating the h-index of a researcher is a two-
dimensional example of the kind of query in which we are interested. This type of query
is difficult to express in both Structured Query Language (SQL) and multidimensional
query languages, such as MultiDimensional eXpressions (MDX). Similarly, sub-sampling
queries as discussed in the next section are also difficult to express. We, therefore, intro-
duce the diamond cube operator to fill that void.
1.1.1 Sub-sampling
Dealing with large data is challenging. One way to meet this challenge is to choose a
sub-sample—select a subset of the data—with desirable properties such as representativ-
ity, conciseness, or homogeneity. In signal and image processing, software sub-samples
data [Poli 99] for visualisation, compression, or analysis purposes: commonly, images
are cropped to focus the attention on a particular segment. In databases, researchers
have proposed similar sub-sampling techniques [Babc 03, Gant 00], including iceberg
queries [Fang 98, Pei 05, Xin 03] and top-k queries [Loh 02a, Loh 02b]. Such reduced
representations are sometimes of critical importance to get good on-line performance in
Business Intelligence (BI) applications [Aoui 09b, Fang 98]. Even when performance is
not an issue, browsing and visualising the data frequently benefit from reduced views [Ben
06a].
Often, business analysts are interested in distinguishing elements that are most crucial to
their business, such as the k products jointly responsible for 50% of all sales, from the
long tail [Ande 06]—the lesser elements. The computation of icebergs, top-k elements,
or heavy-hitters has received much attention [Care 97, Corm 05, Corm 04]. This type of
query can be generalised so that interactions between dimensions are allowed. For ex-
3
Table 1.1: Sales (in million dollars) with a 5,10 sum-diamond shaded: stores need to have
sales above $10 million whereas product lines need sales above $5 million.
ample, a business analyst might want to compute the smallest set of stores and business
hours jointly responsible for over 80% of the sales. In this new setting, the heads and
tails of the distributions must be described using a multidimensional language; compu-
tationally, the queries become significantly more difficult. Hence, analysts often process
dimensions one at a time: perhaps they would focus first on the most profitable business
hours, and then aggregate sales per store, or perhaps they would find the most profitable
stores and aggregate sales per hour. This dissertation proposes a general model, of which
the uni-dimensional analysis is a special case, that has acceptable computational costs and
a theoretical foundation. In the two-dimensional case, the proposal is a generalisation
of I TERATIVE P RUNING [Kuma 99], a graph-trawling approach used to analyse social
networks.
To illustrate the proposal in the BI context, consider the following example. Table 1.1
represents the sales of different items in different locations. Typical iceberg queries might
be requests for stores having sales of at least 10 million dollars or product lines with sales
of at least 5 million dollars. However, what if the analyst wants to apply both thresholds
simultaneously? He or she might contemplate closing both some stores and some product
lines. In our example, applying the constraint on stores would close Chicago, whereas
applying the constraint on product lines would not terminate any product line. However,
once the shop in Chicago is closed, we see that the product line TV must be terminated
4
which causes the closure of the Berlin store and the termination of two new product lines
(Game console and DVD player). This multidimensional pruning query selects a subset
of attribute values from each dimension that are simultaneously important. For all the
attribute values remaining in the diamond, every column (slice) in the location dimension
sums to at least 10 and every row (slice) in the product dimension sums to at least 5. There
are many other similar applications:
• In Web traffic analysis, one could seek predominant traffic sources linking to pages
benefiting substantially from these sources.
• A list could be generated of all movies highly popular among individuals watching
many of these movies.
• One could find products with sales above 10 000 units during months where 100 000
units of these products were sold, to customers who bought at least 5 000 such units
during the selected months.
The proposed operation is a diamond dice [Webb 07], which produces a diamond, as for-
mally defined in Chapter 3. The removal of attributes from Table 1.1 in this example
illustrates an algorithm for computing the diamond, which is formally discussed in Chap-
ter 4.
Other approaches that seek important attribute values, e.g. the Skyline operator [Borz 01,
Mors 07], Dominant Relationship Analysis [Li 06], and Top-k dominating queries [Yiu 07],
require dimension attribute values to be ordered, e.g. distance between a hotel and a con-
ference venue, so that data points can be ordered. The proposed operator requires no such
ordering.
5
1.2 Contribution
This dissertation establishes properties of the diamond cube together with a non-trivial
theory. In every d−dimensional data cube whose measure values are all non-negative,
given a set of positive integers {k1 , k2 , . . . kd }, there exists a unique maximal subcube
where every slice of dimension i has sum at least ki . For example, we could choose
k1 = 5 and k2 = 10 for the two-dimensional cube of Table 1.1. The diamond cube
operator extracts this structure. We prove that diamonds are not only unique but they are
nested. By varying the ki ’s we get a collection of diamonds for a cube and these diamonds
form a lattice. A lattice is a partially ordered set in which all finite subsets have a least
upper bound and greatest lower bound. For example, Table 1.1 contains the 5 × 7, 5 × 8
and 6 × 10 diamonds as illustrated in Figure 1.1.
Algorithms to extract diamonds have been designed, implemented and tested on real data
sets as well as on artificially constructed data with predefined distributions. Our hand-
crafted implementation generates diamonds more than fifty times faster than using generic
database queries on data cubes containing in excess of forty million facts.
Finding dense subcubes in large data is a difficult problem [Lee 06, Barb 01, Ben 06a].
We investigate the role that diamond cubes might play in the solution of three NP-hard
problems that seek dense subcubes. We provide evidence that the diamond operator is a
sensible heuristic for two of them and can provide insight into a potential solution for the
third.
6
Chicago Montreal Miami Paris Berlin Totals
TV 3.4 0.9 0.1 0.9 2.0 7.3
Camcorder 0.1 1.4 3.1 2.3 2.1 9.0
Phone 0.2 6.4 2.1 3.5 0.1 12.3
Camera 0.4 2.7 5.3 4.6 3.5 16.5
Game console 3.2 0.3 0.3 2.1 1.5 7.4
DVD player 0.2 0.5 0.5 2.2 2.3 5.7
Totals 7.5 12.2 11.4 15.6 11.5 58.2
(a) 5 × 7 diamond (original cube)
7
Chapter 2
In this chapter we present the two major paradigms for storing and extracting information
from large data—the relational and multidimensional models. For both the relational and
multidimensional database environments, we provide examples of how a single query can
be applied to multiple dimensions simultaneously.
Diamonds are related to other work in the following ways:
Since its introduction by E. F. Codd in June 1970 [Codd 70], the relational model has
become the standard for storing transactional data. It is implemented in many commer-
cial Relational Database Management Systems (RDBMSs) such as Oracle Database, Mi-
crosoft SQLServer and MySQL. The base operators of relational algebra and their sym-
bols are given in Figure 2.1. Although the natural join operator appears in Codd’s original
paper [Codd 70], this operator can also be expressed as a Cartesian product followed by a
8
σ selection
Q
projection
× Cartesian product
∪ set union
− set difference
∩ set intersection
./ natural join
selection. The operations of relational algebra, except selection, projection and Cartesian
product, require that the relations be union-compatible, i.e. having the same number and
type of attributes; the attributes must be presented in the same order in each relation.
Structured English Query Language [Cham 74], which later became Structured Query
Language (SQL), was developed at IBM as the language to query relational data. SQL
provides a structured set of English templates with which to obtain information from ta-
bles. Since its inception, SQL has undergone many changes and refinements. For instance,
the language based on the original algebra did not include aggregate functions, which were
first proposed by Klug [Klug 82]. Agrawal [Agra 87] introduced a recursion operator to
the relational model. He suggested that a database system that has special algorithms and
data structures to directly implement an operator may outperform a system that relies on
an iterative procedure to execute the same function. The current standard [ISO 08] makes
SQL Turing complete by specifying the syntax and semantics for, among others, loops,
decisions and recursive queries—making it much more powerful than the original rela-
tional model. However, not all commercial vendors support all the SQL language features
and they sometimes include their own proprietary extensions.
Some queries that can be expressed simply in plain English are difficult to express in
SQL, despite the original design goals to provide a simple English-like interface for non-
programmers to interact with the database. Suppose we have a chain of department stores
and record information about our sales, inventory and employees in a typical RDBMS.
The relations Sale, Employee and Product would be linked with appropriate use of foreign
9
prodID salesPersonID
1 a
1 b empID name ...
prodID type ...
2 c a Paul
1 camcorder
2 d b Carol
2 ipod .. ..
. 3 a c Phillip .
3 camera
3 b d Julie
4 cell phone
3 d e Elaine
(a) Product relation
4 c (c) Employee Relation
4 e
(b) Sale relation used for
Queries 2.1, 2.2 and 2.3.
Figure 2.2: Example relational database. Sale relation has foreign keys: salesPersonID →
empID of the Employee table, and prodID → prodID of the Product table.
keys. (See Figure 2.2.) We could ask straightforward queries like: What were the total
sales for employee Julie last month? as in Query 2.1.
SELECT Sum(s.sales)
FROM Sale s, Employee e
WHERE s.date BETWEEN ‘2009-07-01’ AND ‘2009-08-01’
AND e.name = ‘Julie’
AND e.empID = s.salesPersonID;
If, however, we wish to compare all sales persons over a particular time period and sales
of a particular product, the query becomes more complex, as shown in Query 2.2.
When we are interested in mining information simultaneously on two attributes, the query
is much more difficult to express in standard SQL. Suppose our sales data has the form
of Table 2.2b and the required query is: Find a group of sales persons S and a group of
products P, such that each sales person in S sold at least n of the products in P; moreover,
each product in P had a least n sales people selling it. We want n to be maximum, and
we want both S and P to be as large as possible. We are interested in finding the sales
people who have been successful in selling the most varied products and at the same time
10
SELECT e.name, Sum(s.sales)
FROM Sale s, Employee e, Product p
WHERE s.date BETWEEN ‘2009-07-01’ AND ‘2009-08-01’
AND p.type = ‘ipod’
AND e.empID = s.salesPersonID
AND s.prodID= p.prodID
GROUP BY name;
interested in the most popular products sold by a variety of sales personnel. The solution to
this query is two sales persons, Paul and Carol, sold two products, camcorder and camera.
We might try to express this query in SQL using Query 2.3. We constrain the number of
sales personnel and products to be greater than two, although we might not have an idea
of the total number of sales personnel and products that constitute the “largest” set. In any
case, the result from Query 2.3 is an empty set—not what we are looking for.
On-line analytical processing (OLAP) is a database acceleration technique used for deduc-
tive analysis, which the OLAP Council [OLAP] describes in five key words: fast analysis
of shared multidimensional information. It is used in decision support systems to provide
answers to “big picture" queries such as
What were our average quarterly sales in 2006 for each region?
11
location
New York
Detroit
Paris 20 5
Lyon
Quebec
product
Chair
Ontario Table
Dress
Montreal 10 30 Shoe
Aug
April
March
June
May
time July
Researchers and developers have yet to agree on a single multidimensional model for
OLAP [Rizz 06]. Our model is formally described in Section 3.1.
The main objective of OLAP is to have constant-time, or near constant-time, answers for
many typical queries. To achieve such acceleration one can create a cube [Rumm 70]
of data, a map from attribute values in every dimension to a given measure (v1 × v2 ×
· · · × vd → M ). Figure 2.3 illustrates such a cube in three dimensions, time, location
and product, built from the RDBMS relations in Section 2.1. The measure stored at the
intersection of the axes represents a sales total: for example, there were 10 units of shoes
sold in Montreal in March. A business analyst might be interested in seeking trends of
sales across products and time—location is not relevant in this query. In this case, the cube
could be rolled-up, collapsing the dimension hierarchy from three to two. The measure
stored at the intersection of the axes represents the total sales of a product in a given month
and thus would reflect that 30 units of shoes were sold in March. Similarly, if time was
12
ABCD
AB AC BC AD BD CD
A B C D
not important, but location and product were, one could roll-up the cube by aggregating
the sums over time, reflecting that 40 units of shoes were sold in Montreal. In this way
a lattice of cubes can be generated. Figure 2.4 illustrates a four-dimensional data cube
lattice under roll-ups. Data are stored at a coarser granularity in cube ABC than in cube
ABCD, i.e. the measures do not distinguish different attribute values for dimension D,
and a roll-up can derive ABC from ABCD.
It is possible to generate a diamond cube using a language such as Microsoft’s MultiDi-
mensional eXpression language (MDX) for OLAP. We can create subcubes by restricting
values on each dimension in turn until no change is observed in the cube. For example us-
ing the cube from Figure 2.3, Query 2.4 executes the first iteration to compute a diamond:
the query must be repeated until there is no change.
A specialisation of the diamond cube is found in Kumar et al.’s work searching for emerg-
ing social networks on the Web [Kuma 99]. Our approach is a generalisation of their
two-dimensional ITERATIVE PRUNING algorithm. Diamonds are inherently multidimen-
13
create subcube [Sales] as
select {filter([location].children, Measures.[Measure] > 100)}
on 0,
[time].children on 1,
[product].children on 2
from Sales
Query 2.4: Query to compute the 100 × 100 × 100 diamond. All three statements must
be repeated in order, until there is no more change.
sional.
Kumar et al. [Kuma 99] model the Web as a directed graph and seek large dense bipartite
subgraphs. Recall that a bipartite graph is a graph whose vertices can be divided into two
disjoint sets U and V such that every edge connects a vertex in U to one in V : U and V
are independent sets. A bipartite graph is dense if most of the vertices in U and V are
connected. Kumar et al. hypothesise that the signature of an emerging Web community
contains at least one “core”, which is a complete bipartite subgraph with at least i vertices
from U and j vertices from V . In their model, the vertices in U and V are Web pages and
the edges are links from U to V . Seeking an (i, j) core is equivalent to seeking a perfect
two-dimensional diamond cube (all cells are allocated). The iterative pruning algorithm
is a specialisation of the basic algorithm we use to seek diamonds: it is restricted to two
dimensions and is used as a preprocessing step to prune data that cannot be included in the
(i, j) cores.
14
Although their paper has been widely cited [Redd 01, Yang 05, Ragh 03, Holz 06], it
appears that this dissertation is the first to propose a multidimensional extension to their
approach and to provide a formal analysis.
RDBMSs have optimisation routines that are especially tuned to address both basic and
more complex SELECT . . . FROM . . . WHERE . . . queries. However, there are some
classes of queries that are difficult to express in SQL, or that execute slowly, because
suitable algorithms are not available to the underlying query engine. Skyline, top-k and
top-k-dominating queries are examples of such queries, ones that seek a representative
subsample of the data.
SELECT *
FROM Hotels h
WHERE h.city = ‘Saint John’ AND NOT EXISTS(
SELECT *
FROM Hotels h1
WHERE h1.city = ‘Saint John’
AND h1.distance <= h.distance
AND h1.price <= h.price
AND( h1.distance < h.distance OR h1.price < h.price));
The Skyline query [Borz 01] seeks a set of points where each point is not “dominated” by
some others: a point is included in the skyline if it is as good or better in all dimensions
and better in at least one dimension. Attributes, e.g. distance or cost, must be ordered.
Therefore, indexes can be beneficial. Query 2.5 is an example that seeks the set of dom-
inating hotels in Saint John, i.e. they are cheaper or closer to some landmark (possibly a
15
conference venue) than all others. Börzsönyi et al. [Borz 01] present algorithms to imple-
ment the skyline operator and the results of their execution on synthetic data. The overall
performance of the skyline operator was consistently good across all data sets presented
in their study.
Since the skyline was proposed as an extension to SQL, the authors discussed its inter-
operation with other operators, e.g. join. Typically, a skyline operator would be applied
after any joins or GROUP_BY operations. It can be pushed through a join—applied be-
fore the join is computed—if it is non-reductive as defined by Carey [Care 97]. Informally,
such a join is between an attribute x, which cannot be null, and at least one element y in
the joining relation for which x = y holds. The skyline operation can also be pushed
into a join. Consider the example query that asks for employees with high salary and
who work in departments with low budget [Borz 01]. The employees with maximum
salary from each department can be computed first (part of the skyline), then joined with
the departments and the remaining part of the skyline—low budget—computed. Simi-
larly, the computation of a diamond can be pushed into a join. Suppose we are seek-
ing the employees who have made more than $100 000 in sales, of high value products,
and who have sold products whose sales total more than $25 000 000. For the relation
Sale(empID,prodID,total), we can compute the diamond cube as follows:
1. Select all employees who have more than $100 000 in sales. Save the result in
T1(empID, sumOfSales).
2. Select all products whose sales total more than $25 000 000. Save the result in
T2(prodID, sumOfProductSales)
Executing Steps 1 and 2 before the join would result in fewer tuples over which to finalise
the diamond computation than if the join was executed first. The employees and products
16
discarded before the join do not meet the constraint of sales greater than $100 000 or
$250 000, respectively, and thus would never be part of a solution to this query.
As with the skyline query, a diamond dice can be partially executed in SQL. It is, however,
unwieldy to express and it runs slowly. (See Section 5.7.)
2.4.2 Top-k
Another query, closely related to the skyline query, is that of finding the “top-k” data
points. Donjerkovic and Ramakrishnan [Donj 99] frame the top-k problem as one of
finding the best answers quickly rather than all the answers. Their method is applicable
to unsorted and unindexed attributes. (Otherwise the index would be used to provide an
exact result.) The top-k query on attribute χ can be rephrased as: σχ>x , where x is a
cutoff parameter that is expected to return k items. The authors’ example explains this:
“List the top 10 paid employees in the sales department", which can be translated to “List
the employees from sales department whose salary is greater than x”.
System statistics are used to estimate x such that the query returns more than k items. If
fewer than k items are returned the query must be restarted; x is an unknown, but it is
sufficient that x is chosen so that more than k items are in the result set. The authors show
that using their probabilistic approach is much more efficient than the simplest execution
plan: execute the query, sort the result and discard all but k items.
Top-k dominating queries [Yiu 07] combine skyline and top-k to present a set of points
that dominate others in all dimensions, but keep the result set to a user-defined size. The
data is indexed with a tree where each node stores the bounding box delimiting its area,
together with the number of points that fall within the bounding box. The number of other
points dominated are computed with the top-k retained in the solution. The top-k query
may return data points that are not in the skyline, but nevertheless, do dominate. The
result can be obtained without any special domain knowledge or the need for the user to
supply a ranking function, or weights for the dimensions. The objectives of skyline and
17
top-k queries and diamond dicing are orthogonal. The diamond cube comprises the largest
set of attributes that satisfy some multidimensional constraints, whereas top-k and skyline
queries seek to reduce the number of data points in the result.
In Formal Concept Analysis [Will 89] [Godi 95] a Galois (concept) lattice is built from a
binary relation. It is used in machine learning to identify conceptual structures among data
sets. A formal concept is a set of objects—extent—together with the set of attributes—
intent—that describe each of the objects. For example, a set of documents and the set of
search terms those documents match, form a concept. Given the data in Table 2.5a, the
smallest concept including document 1 is the one with documents {1, 2} and search terms
{A,B,C,D}. Concepts can be partially ordered by inclusion and thus can be represented
by a lattice as in Figure 2.5b
Galois lattices are related to diamond cubes: a perfect 3 × 3 diamond is described in
the central element of the lattice of Figure 2.5b. Formal Concept Analysis is typically
restricted to two dimensions although Cerf et al. [Cerf 09] generalise formal concepts by
presenting an algorithm that is applied to more than two dimensions. Their definition of a
closed n−set—a formal concept in more than two dimensions—states that each element is
related to all others in the set and no other element can be added to this set without breaking
the first condition. It is the equivalent of finding a perfect diamond in n dimensions. For
example, given the data of Table 2.1 {(α, γ)(1, 2)(A, B)} is an example of a closed 3-set
and also a perfect 4,4,4-diamond.
Diamonds have the potential to be useful in different areas and we present two specific ap-
plications as examples. First, we suggest they could be used when generating tag clouds—
18
dimension 3 dimension 3 dimension 3
A B C A B C A B C
1 1 1 1 1 1 1 1 1
dimension 2
2 1 1 1 1 1 1
3 1 1 1 1
4 1 1 1 1 1 1
α β γ
dimension 1
Table 2.1: A 3-dimensional relation with closed 3-set {(α, γ)(1, 2)(A, B)}.
{ 12345, 0 }
{ 1234, AC } { 1235, BC }
Search Terms
{ 123, ABC }
A B C D E
1 1 1 1 1 0
{ 35, BCE }
Documents
2 1 1 1 1 0
{ 12, ABCD }
3 1 1 1 0 1
{ 3, ABCE }
4 1 0 1 0 0
5 0 1 1 0 1 { 0, ABCDE }
(a) A 3 × 3 diamond is embedded in (b) Galois lattice. Each element (concept) in
this binary relation. the lattice is defined by its extent and intent.
Figure 2.5: Documents and search terms in an information retrieval system and the corre-
sponding Galois lattice.
visual indexes that typically describe the content of a Web site. Tag clouds are defined and
discussed in the next section. Second, one of the original motivations for this work was
to identify a single large, dense subcube in any given data set. Once found, a dense re-
gion can be exploited to facilitate visualisation of the data or be used for further statistical
analysis. Related work in this area is discussed in Section 2.5.2
2.5.1 Visualisation
It is difficult to visualise large multidimensional data sets [Mani 05, Ben 06a, Tech 05].
Aouiche et al. proposed tag clouds as an alternative [Aoui 09a]. Tag clouds are used
19
Figure 2.6: A tag cloud: more important words are expressed in larger fonts.
20
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p1 p3 p5 p7 p8 p4 p2 p10 p9 p6
L1 L2
L2 L6
L3
L3
L4 L1
L7
L5
L6 L5
L7 L4
L8 L8
Figure 2.8: The allocated cells are in dark grey: they are moved together by the reorgani-
sation.
One motivation for the diamond dice is to find a single, non-trivial dense region within
a data cube. If the region has sufficient density, then hybrid storage becomes possible:
the dense region can be stored as a multidimensional array, whereas the tuples outside
of the dense region can be stored sparsely, e.g. as a list of locations. Another benefit of
partitioning dense and sparse regions is that it enables visual representation of data to
be enhanced. The dense regions can be represented close together on a single screen.
Ben Messaoud et al. [Ben 06a] employ Multiple Correspondence Analysis (MCA) to
reorganise the cube so that dense subcubes are moved together for easy visualisation: A
complete disjunctive table representing the entire cube is built and then MCA is applied
to obtain factorial axes. Eigenvalues for each axis are computed and the user chooses a
number of axes to visualise. We repeat the example from Ben Messaoud et al.’s paper in
Figure 2.8. As with diamond dicing, this approach measures the contribution of a set of
cells to the overall cube. No timings were given in their paper, so it is difficult to compare
the efficiency of MCA and diamond dicing, although the outputs are quite similar.
Some researchers seek multiple dense regions rather than a single one: to more efficiently
compute range sum queries [Lee 06] or as a preprocessing step before further statistical
analysis [Barb 01]. Lee uses a histogram-flattening technique by computing three relative
21
density measures for each attribute in each dimension. Dense regions in each dimension
are then used to build multi-dimensional dense regions. The candidate results are then
refined iteratively. Experiments conducted on synthetic data seeded with clusters, show
that this method achieves good space reduction. However, there is no evidence of experi-
mentation on real data or the time to compute their structure.
Barbará and Wu [Barb 01] seek to identify dense regions that can be further analysed
by statistical methods. They shuffle attributes within dimensions based on a similarity
measure. They do not mention whether rearranging attributes in one dimension has an
adverse affect on the positioning of attributes in another dimension. Intuitively, one would
expect there to be some issues with this process. The example provided in their paper
identified four dense regions and showed how their algorithm would gather them together.
The diamond dice applied to their example would find the largest dense region. If one then
applied a diamond dice to the pruned area iteratively, all the regions identified by Barbará’s
method would also be identified by the diamond dice for this particular example.
22
Chapter 3
In this chapter a formal model of the diamond cube is presented. We show that diamonds
are nested, much like Matryoshkas1 , with a smaller diamond existing within a larger di-
amond. We prove a uniqueness property for diamonds and we establish upper and lower
bounds on the parameter k for both COUNT and SUM-based diamond cubes. One statistic
of a cube C is its carat-number, κ(C); its robustness is discussed in Section 3.6. We con-
clude this chapter by presenting the related NP-complete problem for cubes that contain
both positive and negative measures.
As stated in Chapter 2, researchers and developers have yet to agree on a single multidi-
mensional model [Rizz 06] for OLAP. Our simplified formal model incorporates several
widely-accepted definitions for the terms illustrated in Figure 3.1, together with new terms
associated specifically with diamonds. For clarity, all terms are defined in the following
paragraphs.
A dimension D is a set of attribute values that defines one axis of a multidimensional
data structure. Each dimension Di has a cardinality ni , the number of distinct attribute
1
Russian nesting dolls [Roos 09]
23
location
New York
Detroit
Paris 20 5
Lyon
Quebec
product
Chair
Ontario Table
Dress
Montreal 10 30 Shoe
Aug
April
March
June
May
July
time
(a) 3-D cube
location
New York location
Detroit
New York
Paris 20 5
Detroit
Lyon
Paris 20
Quebec
ALL Lyon
Ontario product
Quebec
Montreal 10 30
Chair
Ontario
Aug
May
April
March
June
July
Table
time Montreal 10 Dress
Shoe
May
March
April
time
24
Otherwise, the measure is ⊥ and we say the cell is empty—an unallocated cell. For
the purposes of this dissertation, a measure is a single value. In more general OLAP
applications, a cube may map to several measures. Also, measures may take values other
than real-valued numbers—booleans, for example.
A slice is the cube C 0 = (D0 , f 0 ) obtained when a single attribute value is fixed in one
dimension of cube C = (D, f ). For example, if we slice on dimension three of C with
value v3 ∈ D3 , then we have f 0 (x1 , x2 , x3 , . . . , xd ) = f (x1 , x2 , v3 , x3 , x4 , . . . , xd−1 ) where
x1 ∈ D1 , x2 ∈ D2 and for 3 ≤ i ≤ d − 1, xi ∈ Di+1 , resulting in a cube with d − 1
dimensions.
A dice defines a subcube S by specifying attribute values D
b 1, D
b 2, . . . , D
b d where each
b i ⊆ Di . The cells in the slices corresponding to the unspecified attribute values are
D
deallocated—deallocation of a cell sets its value to ⊥. The cube is considered to still
have the same set of dimensions, although some may have attribute values whose slices
contain only deallocated cells.
An aggregator is a function, σ, that assigns a real number to a slice of a cube. We
choose aggregators such that they are commutative; their return value is not affected by any
change in order of the values of the slice. For example, SUM is an aggregator: SUM (slicei )
= v1 + v2 + · · · + vm where m is the number of allocated cells in slicei and the vi ’s are the
measures. A slice S 0 is a subslice of slice S (written S 0 @ S) if every allocated cell in S 0
is also an allocated cell in S and they have the same measures. An aggregator σ is mono-
tonically non-decreasing if S 0 @ S implies σ(S 0 ) ≤ σ(S). Similarly, σ is monotoni-
cally non-increasing if S 0 @ S implies σ(S 0 ) ≥ σ(S). Monotonically non-decreasing
operators include COUNT , MAX and SUM over non-negative measures. Monotonically
non-increasing operators include MIN and SUM over non-positive measures. MEAN and
MEDIAN are neither monotonically non-increasing, nor non-decreasing functions.
Many OLAP aggregators are distributive, algebraic and linear. The algorithms we used
to compute the diamond cube require a linear aggregator. An aggregator σ is distribu-
25
tive [Gray 96] if there is a function F such that for all 0 ≤ k < n − 1,
For example, SUM(a0 , . . . , ak , ak+1 , . . . , an−1 ) = SUM (a0 , . . . , ak )+ SUM (ak+1 , . . . , an−1 ).
If G is (COUNT, SUM) then F ((c1, s1), (c2, s2)) = (c1 + c2, s1 + s2).
For example, COUNT, SUM (a0 , a1 ) = F ((1, a0 )(1, a1 )) = (1 + 1, a0 + a1 ) and generally,
COUNT, SUM (a0 , . . . , ak , ak+1 , . . . , an−1 ) = F ((k+1, a0 +a1 +. . . , +ak ), (n−1−k, ak+1 +
ak+2 + · · · + an−1 )) = (n, a0 + a1 + · · · + an−1 )
An algebraic aggregator σ is linear [Lemi 08] if the corresponding intermediate query G
satisfies
for all arrays a, d, and constants α. SUM and COUNT and AVERAGE are linear functions;
MAX , MIN , ROOT- MEAN - SQUARE and MEDIAN are not linear.
Our formal model maps to the relational model in the following ways: (See Figure 3.2.)
• cube corresponds to a fact table: a relation whose attributes comprise a primary key
and a single measure.
26
movie reviewer date rating
1 1488844 2005-09-06 3
1 822109 2005-05-13 5
1 885013 2005-10-19 4
1 30878 2005-12-26 4
1 823519 2004-05-03 3
1 893988 2005-11-17 3
1 124105 2004-08-05 4
1 1248029 2004-04-22 3
Figure 3.2: Part of the Netflix [Benn 07] fact table (cube). Attributes (dimensions) are
movie, reviewer and date. Each row is a fact (allocated cell). The measure is rating.
Diamonds are defined over data cubes where the measures all have the same sign. Cubes
with mixed measures (positive and negative) are discussed in Section 3.7. We define a
novel measure for diamonds, the carat, and show that given k1 , k2 , . . . kd there is a unique
k1 , k2 , . . . kd −diamond.
Given cubes A and B and their respective measure functions fA and fB , A and B are
compatible if fA (x) 6= ⊥ and fB (x) 6= ⊥ =⇒ fA (x) = fB (x).
Definition 3.2. Let A and B be two compatible cubes with the same dimensions. The union
of A and B is a cube denoted A ∪ B. Formally, given A = (D, fA ) and B = (D, fB ),
2
In actual diamonds, one carat is equal to 200 milligrammes. The price of a diamond rises exponentially
with the number of carats [Ency 07]
27
New York New York New York
Detroit Detroit Detroit
Paris 20 Paris Paris 20
Lyon Lyon 15 Lyon 15
Quebec Quebec Quebec
Chair Chair Chair
Ontario Table Ontario Table Ontario Table
10 Dress Dress Dress
Montreal Shoe Montreal 30 Shoe Montreal 10 30 Shoe
May
May
March
March
May
April
April
March
April
Cube A Cube B Cube AUB
Proof. The proof follows trivially from the monotonicity of the aggregator.
Definition 3.3. Let A and B be two compatible cubes with the same dimensions. The
intersection of two cubes A and B is a cube denoted A ∩ B containing the non-empty
cells that are common to A and B. Formally, given A = (D, fA ) and B = (D, fB ), then
A ∩ B = (D, fA∩B ) where
28
fA (x)
if fB (x) 6= ⊥,
fA∩B (x) =
⊥
otherwise
of all subcubes (of some starting cube C) having k1 , k2 , . . . , kd carats. A diamond cube
has k1 , k2 , . . . , kd -carats if it has ki carats over dimension Di . Any reference to a k-carat
diamond cube implies k1 = k2 = . . . = kd = k.
We can think of the maximal cube, the diamond cube, as a dice. However, by a slight
abuse of our definition, and for ease of exposition, we also consider diamond dicing as the
removal of attribute values.
The following propositions show that diamonds are nested and form a lattice. This is
helpful because the x1 , x2 , . . . , xd -carat diamond can be derived from the y1 , y2 , . . . , yd -
carat diamond when yi ≤ xi for all i. In an implementation where materialising the
diamond lattice is considered, an analyst need only choose a subset of elements to store
physically, the others can be derived.
3
When σ is not monotonically non-decreasing (resp. non-increasing), there may not be a unique dia-
mond
29
2,3,3
1,1,1
Proposition 3.2. The diamond having k 0 carats (resp. negative carats) over dimensions
i1 , . . . , id is contained in the diamond having k carats (resp. negative carats) over dimen-
sions i1 , . . . , id whenever k 0 ≥ k.
Proof. Let A be the diamond having k carats and B be the diamond having k 0 carats. By
Proposition 3.1, A ∪ B has at least k carats, and because A is maximal, A ∪ B = A; thus,
B is contained in A.
30
vector associated with every diamond of cube C.
Lemma 3.1. Given a non-empty cube C and a lattice of k-value-vectors, for every non-
empty diamond of C there is a unique most restrictive (k1 , k2 , . . . , kd )-value vector.
We define the most restrictive vector of the empty cube as the (m1 , m2 , . . . , md−1 , md )-
vector, where each mi is dj,j6=i (ni + 1).
Q
Diamonds often cover more than one lattice element and Figure 3.5 illustrates the lattice
and diamonds of all possible cubes of shape 2 × 3 × 1. The lattice elements coloured
black are the most restrictive vectors for their respective diamonds. Figure 3.6e is a two-
dimensional example of this phenomenon.
The lattice creates optimisation opportunities: given the (2, 1)-carat diamond X and the
(1, 2)-carat diamond Y , then the (2, 2)-carat diamond must lie in both X and Y . Although
intersecting X and Y does not necessarily yield the (2, 2)-carat diamond, the intersection
is a good starting point for a (2, 2)-carat diamond computation. (See Figure 3.6d.)
Not only do the unique diamonds of a non-empty cube C cover several elements of the
k-valued-vector lattice, but they also form a lattice of their own as the next proposition
shows.
Proposition 3.3. The collection of all distinct diamonds of a cube C form a lattice.
Proof. First, we need to define a partial ordering over diamonds since a lattice is a par-
tial order. Second, we need to show that for every pair of diamonds of C there exists a
31
diamond #1
3,2,6
diamond #3 diamond #2
diamond #4 diamond #7
diamond #6 diamond #5
1,1,1
Figure 3.5: Lattice of all possible cubes that are diamonds of shape 2 × 3 × 1. The
shaded diamonds cover several lattice elements. Each diamond is associated with its most
restrictive vector (shown in reverse print).
32
× × ×
× × ×
× × ×
× × × × × × × × ×
× × × × × × × × ×
× × × × × × × × ×
(a) The initial data cube C (b) The (1,2) diamond of C. (c) The (2,1) diamond of C.
× ◦ ◦ ◦
× ◦ ◦ ◦
× ◦ ◦ ◦
× × × ◦ ◦ ◦
× × × ◦ ◦ ◦
× × × ◦ ◦ ◦
(d) The (2, 2) diamond (dark grey) (e) Several lattice elements
is contained in the intersection may be the same cube: (1,1)
(grey) of the (2, 1) diamond and (1,2) (1,3) (2,1) (2,2) (2,3)
the (1, 2) diamond of C. (3,1) (3,2) (3,3) are all this
cube D.
From Proposition 3.2 both diamonds are contained in the diamond ∆00 associated with
vector V~00 , where (v100 , v200 , . . . , vd00 ) = (min(v1 , v10 ), min(v2 , v20 ), . . . min(vd , vd0 )). So ∆00 is
a lower bound. To show that it is the greatest lower bound, we need to show that any
other diamond that contains both ∆ and ∆0 is less than ∆00 . Suppose there is a diamond
χ that also contains ∆ and ∆0 and further suppose that χ 6≤ ∆00 . The most restrictive
~ = (x1 , x2 , . . . , xd ) associated with χ has xi > v 00 for some i. Without loss of
vector X i
generality, vi00 = vi and since V~ is most restrictive, ∆ contains a slice whose σ is exactly
vi . But since xi > vi , this slice is not in χ, a contradiction. Therefore, χ ≤ ∆00 and every
pair of diamonds in L has an infimum.
The final step is to show that every pair of diamonds in L has a supremum. Let diamond
∆ be associated with vector V~ = (v1 , v2 , . . . vd ) and diamond ∆0 be associated with vector
V~ 0 = (v10 , v20 , . . . , vd0 ). From Proposition 3.2 both diamonds contain the diamond ∆00 asso-
ciated with vector V~ 00 , where (v100 , v200 , . . . , vd00 ) = (max(v1 , v10 ), max(v2 , v20 ), . . . max(vd , vd0 )).
33
We need to show that any other diamond that is contained by both ∆ and ∆0 is greater than
∆00 . Suppose there is a diamond χ that is also contained by ∆ and ∆0 . Further suppose that
~ = (x1 , x2 , . . . , xd ) associated with χ has xi < vi00
χ 6≥ ∆00 . The most restrictive vector X
for some i. Without loss of generality, vi00 = vi and χ contains a slice whose σ is less than
~ is most restrictive, a contradiction. Therefore, χ ≥ ∆00
vi . This slice is not in ∆ since X
and every pair of diamonds in L has a supremum. The collection of distinct diamonds of
C form a lattice.
The least element of the lattice, the (1, 1, . . . , 1, 1)-diamond, is the diamond where every
slice has at least one allocated cell, typically the initial cube. The greatest element is
the (m1 , m2 , . . . , md−1 , md )-diamond, where each mi is dj,j6=i (ni + 1), and is the empty
Q
diamond.
It is useful to know the ranges of values that k1 , k2 , . . . , kd may take for any given cube.
When the range of values for ki is known, we can guarantee an empty diamond will result
from a query whose ki fall outside the range. Constraints on k when σ is COUNT are given
in the following propositions. Without loss of generality we assume that n1 ≤ n2 ≤ . . . ≤
nd−1 ≤ nd , and ni represents the number of non-empty slices in dimension Di .
By considering a perfect cube, the following is immediate.
Proposition 3.4. Given the sizes of dimensions of a COUNT -based diamond cube, ni ,
an upper bound for the number of carats ki for dimension i of a non-empty subcube is
Qd Qd−1
j=1,j6=i ni . An upper bound on the number of carats k of a non-empty subcube is i=1 ni .
An alternate (and trivial) upper bound on the number of carats in any dimension is |C|, the
number of allocated cells in the cube. For sparse cubes, this bound may be more useful
than that from Proposition 3.4.
34
Intuitively, a cube with large carats needs to have many allocated cells: accordingly, the
next proposition provides a lower bound on the size of the cube given the number of carats.
Proposition 3.5. For d > 1 and σ = COUNT , the size |C| of a d-dimensional non-empty
cube C of ki , k2 , . . . , kd carats satisfies
Y
|C| ≥ max ki ni ≥ ( ki )1/(d−1) , i ∈ {1, 2, . . . , d}.
Proof. For the first inequality we observe that for fixed i, C has precisely ni slices along
dimension i, each of which has at least ki cells, so |C| ≥ ki ni , whence |C| ≥ max{ki ni }.
Q
For the second inequality, observe that by Proposition 3.4,ki ≤ j6=i nj . Therefore, we
have
d
Y d
Y d
Y
ki = k1 ki ≤ ni ki ≤ [max(ni ki )]d−1
i=1 i=2 i=2
Theorem 3.1. For COUNT-based carats, if a cube C does not contain a non-empty k1 , . . . , kd -
carat subcube, then |C| is at most 1 + di=1 (ki − 1)(ni − 1). Hence, it has density at most
P
(1 + di=1 (ki − 1)(ni − 1))/ di=1 ni . In particular, a cube that does not contain a non-
P Q
empty k-carat subcube has size at most 1 + (k − 1) di=1 (ni − 1) allocated cells and
P
35
containing at most k − 1 allocated cells. Remove it. This iterative process can continue at
P
most i (ni − 1) times at which point there is at most one allocated cell left: hence, there
P
are at most (k − 1) i (ni − 1) + 1 allocated cells in total. The more general result follows
similarly.
Pd
Corollary 3.1. A cube with |C| > 1 + i=1 (ki − 1)(ni − 1) that is, having density greater
than
Pd
1+ i=1 (ki − 1)(ni − 1)
Qd ,
i=1 ni
subcube.
Given a cube C and σ, κ(C) is the largest number of carats for which the cube has a non-
empty diamond. Intuitively, a small cube with many allocated cells should have a large
κ(C), and the following proposition makes this precise.
P
Proposition 3.6. For COUNT-based carats, we have κ(C) ≥ |C|/ i (ni − 1) − 3.
Proof. Solving for k in Corollary 3.1, we have a lower bound on the maximal number of
carats:
X
κ(C) ≥ |C|/ (ni − 1) − 3
i
36
3.5 Properties of SUM-based Diamonds
For SUM-based diamonds, the goal is to capture a large fraction of the sum. The statistic,
κ(C), of a SUM -based diamond is the largest sum for which there exists a non-empty
diamond: every slice in every dimension has sum at least κ(C). Propositions 3.7 and 3.8
give tight lower and upper bounds respectively for κ.
Proposition 3.7. Given a non-empty cube C and the aggregator SUM, a tight lower bound
on κ is the value of the maximum cell (m).
Proof. The κ-diamond, by definition, is non-empty, so it follows that when the κ-diamond
comprises a single cell, then κ takes the value of the maximum cell in C. When the κ-
diamond contains more than a single cell, m is still a lower bound: either κ is greater than
or equal to m.
Given only the size of a SUM -based diamond cube (in cells), there is no upper bound on
its number of SUM-carats. However, given its sum, say S, then it cannot have more than S
SUM -carats. We can determine a tight upper bound on κ(C) as the following proposition
shows.
Proof. Let X = {slicej (Di ) | SUM(slicej (Di )) = maxk (SUM(slicek (Di ))} i ∈ {1, . . . , d}
then there is one slice x whose SUM (x) is smaller than or equal to all other slices in X.
Suppose κ(C) is greater than SUM(x) then it follows that all slices in this κ-diamond must
have SUM greater than SUM (x). However, x is taken from X, where each member is
the slice for which its SUM is maximum in its respective dimension, thereby creating a
contradiction. Such a diamond cannot exist. Therefore, mini (maxj (SUM(slicej (Di )))) is
37
an upper bound for κ. To show that mini (maxj (SUM(slicej (Di )))) is also a tight upper
bound we only need to consider a perfect cube where all measures are identical.
We can also provide a lower bound on the sum of a k-carat diamond as the next lemma
indicates.
Lemma 3.2. If the diamond of size s1 × s2 × · · · × sd has k-carats, then its sum is at least
k max(si ) i ∈ {1, 2, . . . , d}.
Proof. For fixed i each of the si slices along dimension i have sum at least k, from which
the result follows.
One statistic of a cube C is its carat-number, κ(C). Is this statistic robust? i.e. with
high probability, can changing a small fraction of the data set change the statistic much?
Of course, typical analyses are based on thresholds (e.g. applied to support and accuracy
in rule mining), and thus small changes to the cube may not always behave as desired.
Diamond dicing is no exception. For the cube C in Figure 3.7b and the statistic κ(C) we
see that diamond dicing is not robust against an adversary who can deallocate a single cell:
deallocation of the second cell on the top row means that the cube no longer contains a
COUNT -based diamond with 2 carats. This example can be generalised.
Proposition 3.9. Let σ be COUNT . There is a cube C from which deallocation of any b
cells results in a cube C 0 with κ(C 0 ) ≤ κ(C) − 2b .
Proof. Let C be a d-dimensional cube with ni = 2. Then κ(C) = 2d−1 = k and |C| = 2d .
Deallocate b cells of C, producing the cube C 0 , so that |C 0 | = 2d − b. Let κ(C 0 ) = k 0 .
From Proposition 3.5 we have |C 0 | ≥ nd k 0 = 2k 0 . Therefore, we have 2d − b ≥ 2k 0 , which
implies k 0 ≤ 2d−1 − b
2
= k − 2b .
38
7 2 1 1 1
7 2 1 1
7 7 1 1
5 2 2 ... ...
2 7 1 1
8 2 1 1
(a) A 9-carat SUM- (b) An n × n cube with 2n al-
diamond. It is not robust located cells and a 2-carat dia-
against the deletion of a mond in the upper left.
single cell.
Conversely, in Figure 3.7b we might allocate the cell above the bottom-right corner,
thereby obtaining a 2-carat diamond with all 2n + 1 cells. Compared to the original case,
we see that a small change effects a very different result.
The statistic κ(C) is no more robust when σ is SUM as the following example shows:
Example 3.1. Consider Figure 3.7a. By deallocating the second cell in the fourth row (5)
the 9-carat SUM-diamond becomes the 7-carat SUM-diamond and κ(C 0 ) < κ(C).
Conversely, one can add a single cell, with a value greater than the current κ(C), to the
cube C resulting in cube C 0 with κ(C 0 ) > κ(C).
Diamond dicing is not, in general, robust. However, it is perhaps more reasonable to follow
Pensa and Boulicaut’s work on Formal Concept Analysis [Pens 05] and ask whether κ
appears, experimentally, to be robust against random noise on realistic data sets. This is
explored experimentally in Subsection 5.10.
When we consider the SUM operator, the definition of a diamond cube is restricted to
cubes with all positive (resp. all negative) measures. If one allows negative and positive
measures in the same cube, there may no longer be a unique solution to the problem ‘find
39
col1 col2 col3 col4
row1 1 1 1 1
row2 2 -4 1 1
row3 1 1 1 1
row4 1 1 1 1
Intuitively, we can see that there may be cases when an attribute value is deleted early and,
as a result of a subsequent deletion, it should be reinstated into the diamond. Consider
Figure 3.11. If the rows are processed first, then row 1 and row 2 are deleted on the first
pass. When the columns are processed, column 1 is deleted, which should result in the
reinstatement of rows 1 and 2.
Data cubes, in general, are not restricted to positive (resp. negative) measures. Consider
Figure 3.12, a matrix representing people with items they wish to trade and items they
require. Negative measures indicate the number of items required and positive measures
the number of items available for trade. The Maximum Trade Problem (MTP) [Luo 94]
takes just such a matrix and seeks the largest set of trades possible.
An MTP Instance: A matrix A = [ai,j ] of size m×n where each ai,j is an integer—possibly
40
row col 1 col 2 col 3 col 4 col 5 col 6
1 -2 1 1 1 0 2
2 -30 4 1 0 1 0
3 2 2 4 0 2 1
4 0 2 3 1 0 2
Figure 3.11: There is no unique diamond for κ = 5. Rows 1 and 2 could be reinstated
after deletion of column 1.
Figure 3.12: An instance of MTP. All trades can be carried out if a broker collects all the
goods offered and distributes all the goods wanted. She would then reap a profit of one
puppy.
Proof. The first requirement is to show that kSCP belongs to the class NP. This is straight-
forward. If a cube A = [ai,j ], i = 1, . . . , m; j = 1, . . . , n defines a possible solution to the
problem, then checking that this satisfies the constraints:
41
• Σai,1 , Σai,2 , . . . , Σai,n ≥ k1 for all i = 1 . . . m
• m ≥ s1 and n ≥ s2
• s1 = p
• s2 = n
• k1 = −∞
• k2 = 0
Then MTP is equivalent to asking whether there is a 0, −∞-carat cube having at least size
(p, n). The constructed kSDP instance has answer YES if and only if the original MTP
instance has answer YES.
As stated in Section 3.2, we restrict diamonds to the cubes where all measures are positive
(resp. negative). We have proved a uniqueness property for diamonds and established
42
7 1 2 4 1 -1
9 3 4 -1 2 1
row sums of original 5 -1 3 1 1 1
9 1 -1 2 2 5
-1 1 -3 1 -1 1
(column sum - k2 ) ≥ 0 0 0 0 0 0
upper and lower bounds on κ(C). The logical next step is to find efficient algorithms to
compute diamonds. This is the topic of the next chapter.
43
Chapter 4
Algorithms
44
then remove the attribute value and its slice. The identification of “bad” attribute values is
done conservatively, in that they are known already to have a sum less than required (σ is
SUM ), or insufficient allocated cells (σ is COUNT ). When the algorithm terminates, only
attribute values that meet the condition in every slice remain: a diamond.
Algorithms based on this approach always terminate, though they might sometimes return
45
input: a d−dimensional data cube C, a monotonic linear operation σ and
k1 ≥ 0, k2 ≥ 0, . . . , kd ≥ 0
output: the diamond cube A
stable ← false
while ¬stable do
// major iteration
stable ← true
for dim ∈ {1, . . . , d} do
for i in all attribute values of dimension dim do
Cdim,i ← σ(slice i on dimension dim)
if Cdim,i < kdim then
delete attribute value i
stable ← false
end
end
end
end
return cube without deleted attribute values;
Algorithm DD: Algorithm to compute the diamond of any given cube by deleting
slices eagerly.
an empty cube. By specifying how to compute and maintain sums for each attribute value
in every dimension we obtain different variations. The correctness of any such variation
is guaranteed by the following result.
Theorem 4.1. Algorithm DD is correct, that is, it always returns the k1 , k2 , . . . , kd -carat
diamond.
Proof. Because the diamond is unique, we need only show that the result of the algorithm,
the cube A, is a diamond. If the result is not the empty cube, then dimension Di has at
least value ki per slice, and hence it has ki carats. We only need to show that the result of
Algorithm DD is maximal: there does not exist a larger k1 , k2 , . . . , kd -carat cube.
Suppose A0 is such a larger k1 , k2 , . . . , kd -carat cube. Because Algorithm DD begins with
the whole cube C, there must be a first time when one of the attribute values of dimension
dim of C belonging to A0 but not A is deleted. At the time of deletion, this attribute’s
slice cannot have obtained more cells so it still has value less than kdim . Let C 0 be the cube
at the instant before the attribute is deleted, with all attribute values deleted so far. We see
46
that C 0 is larger than or equal to A0 and therefore, slices in C 0 corresponding to attribute
values of A0 along dimension dim must have more than kdim carats. Therefore, we have a
contradiction and must conclude that A0 does not exist and that A is maximal.
Implementations of Algorithm DD, and variations that can process data sets of an unre-
stricted size, are discussed in the next two sections. The chapter concludes with three
alternate approaches to determine κ(C).
When the volume of the cube to be processed fits into main memory, there are some
implementation choices to be made:
47
cell; a ‘0’ an unallocated cell [Albe 06]. An issue that hinders performance of an
array-based implementation is the number of zeros that are stored in sparse cubes.
Typically data cubes are not dense; the large data cubes used in our experiments had
densities of between 5.4 × 10−6 and 1.6 × 10−23 .
The overall performance of the array-based implementation was poor, taking hours
to process a moderately-sized 4-dimensional cube of two thousand allocated cells
Q
with volume 102 608. Since every iteration runs in Ω( ni ) time—every potential
cell is counted in each iteration—it is very inefficient. It might work well on more
dense cubes, but this approach was not explored further, because of the inefficiency
and the fact that we wanted to process large sparse data sets.
• A multi-list [Knut 97] is a more efficient data structure for an in-memory implemen-
tation. It is a multiply linked list whose nodes belong to several lists, one per dimen-
sion. Each allocated cell is aware of all the lists it belongs to, as in Example 4.2.
Only allocated cells are stored, removing the main obstacle to good performance
of the array-based implementation. A multi-list was used in the implementation of
the queue-based Algorithm QDD. Effectively, in each outermost loop, slices to be
deleted are en-queued until all dimensions have been processed. Then, the queue is
completely processed, and the next iteration of the outermost loop can begin. Let
I be the number of iterations until the diamond stablises. Algorithm QDD runs in
time O(I|C|d); for each iteration we examine the remaining attributes in every di-
mension d. The SUMS can be maintained in a multidimensional array assuring a
bound of O(1) for verifying σ(slice) and since order of insertion or deletion is not
important we can assume a bound of O(1) for queue operations.
48
4.2 External-Memory Algorithms
The size of available memory affects the capacity of in-memory data structures to repre-
sent data cubes. In our experiments we restricted the size of available memory to less than
1 GiB. It is, of course, possible to associate a region of memory with a file, which can be
bigger than RAM, and thus implement a data structure that uses more space than the avail-
able RAM. Since some of the data files we wish to process are very large—over 20 GiB—
the execution speed of such an algorithm may still be very slow and careful attention must
be paid to data access patterns [Coze 04]. The data transfer rate of external disks—in
particular, flash memory Solid State Drives (SSD)—is getting closer to that of main mem-
ory [Lee 09]. However, flash media blocks must be erased before they can be reused so
random write access is considerably more expensive than a read operation [Agra 08].
Because a memory-mapped data structure may require random writes, such an implemen-
tation on an SSD might not provide the most efficient solution. A simpler and more natural
extension, which would enable larger cubes to be processed, is to find an efficient exter-
nal memory implementation. The data cube can be stored in an external file whilst the
important COUNTS (resp.SUMS) are maintained in memory.
Algorithm DD visits each dimension in sequence until it stabilises. Ideally, the stabili-
sation should occur after as few iterations as possible. Let I be the number of iterations
through the input file till convergence; i.e. no more deletions are done. Value I is data
P
dependent and (by Figure 4.4) is Θ( i ni ) in the worst case. In practice, I is not ex-
pected to be nearly so large, and working with large real data sets I did not exceed 56.
Algorithms DD and QDD run in time O(Id|C|), providing that we can iterate through and
delete all cells in a slice, each in constant time per cell.
Experimentally, it appears that the relationship of I to k is bi-tonic, i.e. it is non-decreasing
to κ + 1 and non-increasing thereafter. Is this always the case? Unfortunately, there is at
least one cube for which this is not the case. Figure 4.2 illustrates such a cube. On the
first iteration, processing columns first for the 2-carat diamond, a single cell is deleted. On
49
input: file inFile containing d−dimensional cube C, integers k1 , k2 . . . kd > 0, a
monotonic linear operation σ
// preprocessing scan computes σ values for each slice
foreach dimension i do
Create hash table hti
foreach attribute value v in dimension i do
if σ( slice for value v of dimension i in C) ≥ ki then
hti (v) = σ( slice for value v of dimension i in C)
stable ← true
foreach row r of inFile do
(v1 , v2 , . . . , vd ) ← r
notDeleted ← true
for i ∈ {1, . . . , d} do
if vi ∈ / dom hti then
for j ∈ {1, . . . , i − 1, i + 1, . . . , d} do
if vj ∈ dom htj then
htj (vj ) = htj (vj ) − σ({r})
if htj (vj ) < ki then
remove vj from dom htj
end
end
end
stable ← false
notDeleted ← false
break
// only delete this cell once
end
end
if notDeleted then
write r to outFile
end
end
end
if ¬stable then
inFile ← outFile // prepare for another iteration
end
return outFile
Algorithm SIDD: Diamond dicing for relationally stored cubes. Each iteration, less
data is processed. 50
× × ×
× × ×
× × ×
× ×
× ×
× ×
... ...
× ×
×
Figure 4.2: The 2-carat diamond requires more iterations to converge than 3-carat dia-
mond.
subsequent iterations at most two cells are deleted until convergence. The 3-carat diamond
converges after a single iteration. The value of k relative to κ does, however, influence I.
Typically, when k is far from κ—either less or greater—fewer iterations are required to
converge. However, when k exceeds κ by a very small amount, say 1, then typically a
great many more iterations are required to converge to the empty cube.
Algorithm DD checks the σ-value for each attribute on every iteration. Calculating this
value directly, from a data cube too large to store in main memory, would entail many
expensive disk accesses.
The first external-memory solution, Algorithm Shrinking-Input Diamond Dice ( SIDD),
employs d hash tables that map attributes to their σ-values. A preprocessing step iterates
once over the input file creating the d hash tables. When σ = COUNT , the σ-values for
each dimension form a histogram, which might be precomputed in a DBMS.
If the cardinality of any of the dimensions is such that hash tables cannot be stored in
main memory, then a file-based set of maps could be constructed. However, given a d-
dimensional cube, there are only di=1 ni slices and so the memory usage is O( di=1 ni ):
P P
51
Algorithm SIDD reads and writes the files sequentially from and to disk and does not
require potentially expensive random access, making it a candidate for a data-parallel im-
plementation. In the data-parallel model, multiple processors execute the same commands
on a partition of the data. In this case, the cube could be partitioned with each processor
having access to a shared set of hash tables.
Algorithm SIDD runs in time O(Id|C|); each attribute value is deleted at most once. We
have I iterations of the outermost loop. At each iteration, the body of the j loop is executed
at most |C| times. Therefore, we examine O(|C|) records each of which has d fields. We
assume an amortized bound of O(1) for each hash table look-up or modification, and
expect an implementation using the Hashtable class from the Java API to have such a
bound on hash table operations. Often, the input file decreases substantially in the first
few iterations and those cubes are processed faster than the overall bound suggests. The
more carats we seek, the faster the file decreases initially.
4.2.2 Normalisation
A major cost of executing any external-memory algorithm is the time spent accessing
data on disk. It is prudent, therefore, to reduce the number of I/O operations as much
as possible: one way this can be achieved is to store the data cube as normalised binary
integers using bit compaction [Ng 97]—mapping strings to small integers starting at zero.
Recall, of course, that there are a finite number of values that a 32-bit integer can take. One
could use 64-bit integers, thereby increasing the potential size of each dimension in the
data cube. However, the overall file size may not be decreased and unless the dimension
sizes warrant it, a 32-bit integer is preferred as it consumes 50% less memory.
A file storing binary data may be a fraction of the size of a file storing textual data. For
example, given a comma-separated file where each three-dimensional record can be repre-
sented in 20 bytes, the same data when normalised requires only 12 bytes. This represents
a space-saving of up to 40% and an associated reduction in the number of disk accesses.
52
With this in mind, a variation on Algorithm SIDD was implemented, one that takes a bi-
nary file of packed integers as its input. Consequently, the internal data structure used to
maintain the list of SUMS for each attribute value is a set of ragged arrays. By normalis-
ing the data the files sizes are reduced and expensive string manipulation eliminated, thus
giving an improvement in performance.
The speed of convergence of Algorithm SIDD and indeed the size of an eventual diamond
may depend on the data-distribution skew. Cell allocation in data cubes is very skewed
and frequently follows Zipfian distributions [Nade 03]. Suppose the number of allocated
cells Cdim,i in a given slice i follows a Zipfian distribution: P (Cdim,i = j) ∝ j −s for
s > 1. The parameter s is indicative of the skew. We then have that P (Cdim,i < ki ) =
Pki −1 −s P∞ −s
j=1 j / j=1 j = Pki ,s . The expected number of slices marked for deletion after
one pass over all dimensions, prior to any slice deletion, is thus di=1 ni Pki ,s .This quantity
P
grows fast to di=1 ni (all slices marked for deletion) as s grows. (See Figure 4.5.) For
P
SUM -based diamonds, we not only have the skew of the cell allocation, but also the skew
of the measures to accelerate convergence. In other words, we expect Algorithm SIDD to
converge quickly over real data sets, but more slowly over synthetic cubes generated using
uniform distributions.
The related work of Kumar et al. [Kuma 99] models the Web as a large directed graph.
They suggest that large bipartite graphs found within the Web characterise emerging com-
munities. In their model an emerging community has an (i, j) core which comprises sets
of i web pages each linking to a set of j pages (“fans”) and sets of j pages all of which
have i links to them (“centres”). Their first step is to find these cores—bicliques—and
then use them to discover the rest of the web community—bipartite graph. They discuss
two different algorithms, both employing a sorting step, with which to prune away nodes
that cannot be part of a core. A variation of their first algorithm has been implemented
53
cell productID salesPersonID month
i 1 a Jan
ii 1 b Jan
iii 2 c Feb
iv 2 d Mar
v 3 a Feb
vi 3 b Feb
vii 3 d Apr
viii 4 c Apr
ix 4 e Jan
and extended to d dimensions. It seeks a bipartite graph directly and its pseudocode is
given in Algorithm Iterative-Pruning Diamond Dice ( IPDD). Algorithm IPDD runs in
time O(I|C|d2 ). We have I iterations of the outermost loop. At each iteration we process
d files of O(|C|) records and each record has d items. We assume an amortized bound of
O(1) for each hash table look-up or insertion.
The second algorithm discussed by Kumar et al., one they refer to as the inclusion-
exclusion algorithm, is unsuitable for use as a diamond builder since they require that
fans are related to exactly i centres and that centres are related to exactly j fans. This does
not conform to the definition of a diamond cube which allows for attributes to be related
to k or more attributes in all other dimensions.
A data cube can be represented in a relational database by a fact table. (See Figure 4.3.)
In this example the dimensions are: productID {1,2,3,4}, salesPersonID {a,b,c,d,e} and
month {Jan,Feb,Mar,Apr}. The cell numbers are noted in this example, but they do not
form part of the cube. Formulating a diamond cube query in SQL is challenging since
examples can be constructed where only one attribute ever has σ(slicedim,i ) < k, but its
removal exposes another attribute, and so on. See Figure 4.4 and consider computing the
54
input: d files each sorted on one of the dimensions, containing a d−dimensional
cube C, integers k1 , k2 . . . kd > 0, empty hash sets s1 , s2 , . . . , sd
output: the diamond data cube
stable ← false
while ¬stable do
stable ← true
for i ∈ {1, . . . , d} do
file ← sortedfilei
r ← current row
(attribute1 , attribute2 , . . . , attributed , val) ← r
while ¬EOF do
fi ← attributei
sum ← 0
repeat
// traverse slice fi in dimension i
deleted ← false
for j ∈ {1, . . . , d} do
if attributej ∈ dom sj then
// this attribute was deleted previously
deleted ← true
end
end
if ¬deleted then
// only consider the undeleted attributes
sum ← sum + val
end
r ← next row
(attribute1 , attribute2 , . . . , attributed , val) ← r
fieldi ← attributei
until fieldi 6= fi or EOF
if sum < k and fi 6∈ dom si then
si add fi
stable ← false
end
end
end
end
// All deleted attributes are now recorded in the hash
sets.The diamond is extracted with a simple file
scan and check
Algorithm IPDD: Iterative pruning algorithm due to Kumar [Kuma 99]. Requires d
copies of the data sorted in each of d dimensions.
55
× ×
× ×
× ×
× ×
... ...
× ×
× ×
Figure 4.4: An n × n cube with 2n allocated cells (each indicated by a ×) and a 2-carat
diamond in the upper left: it is a difficult case for several algorithms.
2-carat diamond from the cube in Figure 4.3. Every dimension must be examined three
times to determine that cells i, ii, v and vi comprise the 2-carat diamond.
Using nested queries and joins, we could essentially simulate a fixed number of iterations
of the outer loop in Algorithm DD. Unfortunately, we cannot bound the number of itera-
tions for that loop by a constant, leaving our pure-SQL approach as merely approximating
the diamond. In Figure 4.4, we see an example where the number of iterations I = Ω(n)
and stopping after o(n) iterations results in a poor approximation with Θ(n) allocated cells
and attribute values—whereas the true 2-diamond has 4 attribute values and 4 allocated
cells.
Algorithm SQLDD runs in time O(Id|C|), assuming hash indices permitting O(1)-time
checks on any temporary table tempi . In the event that a B-tree index is used, the running
time of Algorithm SQLDD would increase to O(Id|C| logb (|nd |)) where b is the branching
factor of the tree. Often, the relation R decreases substantially in the first few iterations,
and those cubes are processed faster than this bound suggests.
We express the essential calculation in core SQL, as Algorithm SQLDD, but this calcu-
lation is repeated by an “outermost loop” outside of SQL1 . Algorithm SQLDD can be
executed against a copy of the fact table, which becomes smaller as the algorithm pro-
gresses. An alternate implementation, one that avoids the expensive DELETE operation,
1
SQL-2003 supports looping and recursion [?]; thus the entire algorithm could be implemented in SQL,
for those DBMSs that actually implement enough of SQL-2003.
56
INPUT: a d−dimensional data cube C and k > 0
OUTPUT: the diamond A
initialise R to C, the fact table
repeat {major iteration}
execute the fragment of SQL pseudocode shown below
until no records were deleted from R
return R as A
...
CREATE TABLE temp d AS
(SELECT dim d FROM R
GROUP BY dim d HAVING σ(measure) < k);
DELETE FROM R
WHERE dim 1 IN (SELECT * FROM temp 1 ) OR . . .
dim d IN (SELECT * FROM temp d );
Algorithm SQLDD: Variation where the inner two loops in Algorithm DD are (al-
most) computed in SQL (Indices on the temporary relations are assumed.) This
process can be repeated until R stabilises.
is to supplement the fact table with a boolean-valued column χ that indicates whether this
fact is in the diamond cube. If an index is created for each column, excluding χ, the dia-
mond cube computation can be sped up. In this new context, deletion of a slice is simply
of the form “UPDATE facttable SET χ = 0 WHERE DIM = i”. Recreating a smaller fact
table after the first few passes may be used to further speed up the query since we expect
the fraction of slices marked for deletion to be high initially. To determine the exact com-
putational complexity of these SQL queries, we need to specify what type of index is used.
Ideally, the indexes should have the performance of a hash table, i.e. with constant time
look-up. When k1 = k2 = . . . = kd , we can replace χ with an integer-valued column that
indicates whether the cell belongs to the χi -carat cube since each cell belongs to a single
smallest diamond cube. See Table 4.1 as an example. It is impractical to maintain this
kind of information for every possible combination of k1 , k2 , . . . , kd .
Four different versions of Algorithm SQLDD were implemented and tested using MySQL
57
Table 4.1: Fact table with a column indicating the maximum number of carats for the
corresponding cell in Table 2.2b.
5.0. The only indexing option available for our large cubes was a B-tree, so for each
version B-tree indexes were built on all attribute values, except for column χ. (Code is
given in Appendix B.1.)
• The first (SQL1) deleted slices from a copy of the data cube whenever the COUNT
• The second (SQL2) did not delete any information from the data cube, but instead
updated a boolean valued-attribute.
• The third version (SQL3) updated a boolean-valued attribute and also rebuilt the
data cube when the number of cells marked for deletion was at least 50% of the
original.
• The fourth version (SQL4) updated a boolean-valued attribute and rebuilt the data
cube whenever 75% of the remaining cells were marked for deletion.
58
1
The determination of κ(C), the largest value of k for which C has a non-trivial diamond,
is a special case of the computation of the diamond-cube lattice. (See Proposition 3.2.)
Identifying κ(C) may help guide analysis. Once κ is determined we can remove the κ-
diamond, then seek the κ-1 diamond, then seek the κ−2 diamond and so on, until we have
a set of dense clusters, which can be used for further statistical analysis as suggested by
Barbará [Barb 01]. We assume an integer value for κ(C). In the case of SUM -diamonds,
where κ may not be an integer, we compute bκ(C)c. Three approaches to finding κ were
identified:
1. Set the parameter k to 1 + the lower bound, provided by Proposition 3.62 for COUNT-
diamonds or Proposition 3.7 for SUM -diamonds, and check whether there is a dia-
mond with k carats. Repeat, incrementing k, until an empty cube results. At each
step, Proposition 3.2 says the computation can start from the cube from the previous
2
or Theorem 6.2
59
iteration, rather than from C. If the lower bound is very far from κ this results in a
long, slow computation and, therefore, may not be the best approach.
2. Observe that κ(C) is in a finite interval. For COUNT -diamonds, we have a lower
Qd−1
bound from Proposition 3.6 and an upper bound i=1 ni or |C|. (If this upper
bound is unreasonably large, the computation could start with the lower bound and
repeatedly double it.) For SUM -diamonds, we have a lower bound from Proposi-
tion 3.7 and an upper bound from Proposition 3.8. Execute the diamond-dicing
algorithm and set k to a value determined by a binary search over its valid range.
Every time a new midpoint k is tested, the computation can begin from the largest
nonempty k-diamond that has been seen so far. (See Proposition 3.2 and note that
Algorithm SIDD does not destroy its input.)
The second approach is better. Let us compare one iteration of the first approach
(which begins with a k-carat diamond and seeks a k + 1-carat diamond) and a com-
parable iteration of the second approach (which begins with a k-carat diamond and
seeks a (k + kupper )/2-carat diamond). Both will end up making at least one scan,
and probably several more, through the k-carat diamond. Binary search, when it
makes an unsuccessful attempt while narrowing in on κ(C), may result in fewer
scans since, experimentally, we observed that processing a k-carat diamond for val-
ues of k much larger than κ(C) typically requires fewer scans. Furthermore, a binary
search makes fewer attempts overall unless κ(C) lies within log(upper − lower) of
the lower bound. Binary search is recommended, given that it will find κ(C) in
O(log κ(C)) iterations.
3. Our third approach is one adapted from a numerical technique. Although function
dependent, it has been shown to exhibit a convergence rate of log(log(n)) over some
continuous and monotonically increasing functions [Arme 85]. It is not normally
used for searching memory-resident data since the calculations are more costly and
60
complex than the bit-shift required for binary search. However, since the major
time cost of searching for κ occurs during diamond-building, any method that could
reduce the number of diamonds generated is likely to be a good choice. The Pegasus
method, first introduced by Dowell and Jarratt [Dowe 72], is a modified Regula-Falsi
method for finding a simple root of a continuous function and has been adapted to
the discrete case [Stra 82]. The new guess for κ, x3 , is generated by equation 4.1,
where x2 is the most recent guess, x1 is the guess before that and f2 and f1 are the
file sizes of the x1 and x2 diamonds.
When the most recent guess for κ falls on the same side of the root as the previous
guess, the value for f1 is modified using equation 4.2.
The Pegasus method assumes a single zero on the interval to be searched, but we
want to find the point at which the function first becomes zero (the k-value of the
first empty diamond). Applying Equation 4.1 and Equation 4.2 will not modify the
value of x3 : we have already found a zero—the empty cube at the upper end of the
search interval. Therefore, we must adjust f1 or f2 to force a change of value for x3 .
The choices experimented with were values close to zero, −f1 , and much smaller
than −f1 . The results are discussed in Chapter 5.
61
Chapter 5
Experiments
• diamonds can be computed efficiently, i.e. within a few minutes and using less than
1 GiB of memory even for very large data sets.
We implemented the three different strategies for finding κ(C) as described in Section 4.3.
The lower bound estabished in Proposition 3.6 was used as an initial guess for κ(C) and
in every case it returned a non-empty diamond.
A comparison of the algorithms presented in Chapter 4 illustrates that the customised
Java code computes diamonds more than fifteen times faster than either the sorting-based
approach of Algorithm IPDD, or the SQL-based approach of Algorithm SQLDD on cubes
of any significant size. Since it is difficult to express the diamond dice query in SQL, and
our Java implementation was faster, we recommend the addition of the diamond dice as
primitive to a future version of SQL.
62
We discovered that the κ-diamond captured a large fraction of some of our cubes. There-
fore, we designed experiments to determine the relationship (if any) between the number
of dimensions and the size of the κ-diamond. The results indicate that the distribution of
the data has more effect on the size of a diamond than the number of dimensions.
This chapter concludes with some experiments designed to test the robustness of κ(C).
They were carried out on real-data sets and a synthetically generated cube whose attributes
follow a Zipfian distribution. From these experiments, we find that κ(C) is robust against
modest amounts of missing data.
All experiments were carried out on an Apple Mac Pro with two double-core Intel Xeon
processors (2.66 GHz) and 2 GiB of RAM. Experiments used a 500 GiB SATA Hitachi
disk (model HDP725050GLA360 [Hita 09, Dell 09]), with average seek time (to read)
of 14 ms, average rotational latency of 4.2 ms, and capability for sustained transfers at
300 MiB/s. This disk also has an on-board cache size of 16 MiB, and is formatted for the
Mac OS Extended filesystem (journaled). The software runs on Mac OS 10.4.
5.2 Implementations
An implementation of Algorithm DD was validated on several small cubes for which the
solution could be found by hand. It was then applied to a couple of cubes extracted from
TWEED [Enge 07] a small real data set. These cubes became our unit test data and each
algorithm implementation was validated against them. The code is archived at my personal
website [Webb 09].
Algorithms SIDD, IPDD and SQLDD were implemented in Java using SDK version 1.5.0.
At run-time the -server flag was set; it increased the memory usage to 192 MB 1 and turned
1
as reported by the jstat utility
63
Table 5.1: Cube K2: the number of duplicates was recorded and used as the measure for
the SUM-diamond.
(a) Raw data (b) Cube K2 data
aaron aaron abihu aaron aaron aaron abihu aaron 1
aaron aaron abihu eleazar aaron aaron abihu eleazar 3
aaron aaron abihu eleazar aaron aaron abihu ithamar 3
aaron aaron abihu eleazar aaron aaron accept befor 1
aaron aaron abihu ithamar aaron aaron acept lord 1
aaron aaron abihu ithamar
aaron aaron abihu ithamar
aaron aaron accept befor
aaron aaron acept lord
A varied selection of freely-available real-data sets together with some systematically gen-
erated synthetic data sets were used in the experiments. Each data set had a particular char-
acteristic: a few dimensions or many, dimensions with high or low cardinality or a mix
of the two, small or large number of cells. They were chosen to illustrate that diamond
dicing is tractable under varied conditions and on many different types of data.
Brief descriptions of how each data cube was extracted from the raw data follow: the
complete details are given in Appendix A. In every case duplicates were eliminated from
the resulting cube by aggregating the measure. Table 5.1 shows the “before” and “after”
view of a subset of one of our cubes, K2.
64
5.3.1 Real Data
Six of the seven real-data sets were downloaded from the following sources.
4. Census 1881:http://www.prdh.umontreal.ca/census/en/main.aspx
The seventh data set was generated from the King James version of the Bible available at
Project Gutenberg [Proj 09].
TWEED
The TWEED [Enge 07] data set contains over 11 000 records of individual events related
to internal terrorism in 18 countries in Western Europe between 1950 and 2004. Of the
52 attributes in the TWEED data, 37 were regarded as measures since they recorded the
number of people killed and injured in each of the terrorist attacks. For instance, a record
might indicate that 5 people were killed and 10 people injured, of whom 2 were police, 11
were civilians and 2 were other militants. For our purposes these measures were aggre-
gated and thus the measure associated with this record in our data cube would be 15. Cube
TW1 retained dimensions Country, Year, Action and Target with cardinalities of 16 × 53
× 11 ×11. For cubes TW2 and TW3 all fifteen dimensions not deemed measures were
retained. Cardinalities of the dimensions ranged from 3 to 284. Table 5.2 gives statistics
of the TWEED data and full details can be found in Table A.2.
65
Table 5.2: Statistics of TWEED, Netflix and census data.
Pd
cube dimensions |C| i=1 ni measure
TW1 4 1 957 88 count
TW2 15 4 963 674 count
TW3 15 4 963 674 sum killed
NF1 3 100 478 158 500 137 count
NF2 3 100 478 158 500 137 sum rating
NF3 4 20 000 000 473 753 count
C1 27 135 753 5 607 count
C2 27 135 753 504 sum stocks
C3 27 135 753 504 sum wages
C4 7 4 277 807 343 442 count
Netflix
At approximately 2 GiB, the Netflix [Netf 07] data set is the largest openly-available
movie-rating database. It comprises 4 dimensions: MovieID × UserID × Date × Rating.
The data can be downloaded as a set of 17 770 text files in a tar archive. Each file contains
between 4 and 232 945 ratings for a single movie, attributed to distinct reviewers. The files
were amalgamated into a single comma-separated file by executing a purpose-built Java
application. Two three-dimensional cubes NF1 and NF2, both with about 108 allocated
cells, were extracted using dimensions MovieID, UserID and Date. The measure, Rating,
together with the aggregator, SUM, was used for NF2, whereas the COUNT aggregator was
used for NF1. A third cube—NF3—was extracted from the Netflix data when we discov-
ered that the SQL algorithm was taking a very long time (over 24 hours) to run. NF3 has
four dimensions and it comprises the first 20 million records of the data. The aggregator
COUNT was used for NF3.
Census Income
The third real-data set, Census-Income, comes from the UCI KDD Archive [Hett 00]. The
cardinalities of the dimensions ranged from 2 to 91 and there were 135 753 records. The
original 41 dimensions were rolled-up to 27 and three cubes were extracted. The COUNT
66
Table 5.3: Statistics of forest, weather and King James Bible data.
Pd
cube dimensions |C| i=1 ni measure
F1 11 580 950 15 923 count
W1 12 123 245 878 50 320 count
W2 12 123 245 878 50 320 sum total cloud cover
W3 12 1 002 751 18 597 count
K1 4 363 412 308 33 588 count
K2 4 363 412 308 33 588 sum 4-gram instances
K3 10 986 145 104 1 558 count
K4 10 986 145 104 1 558 sum 10-gram instances
K5 4 24 000 000 8 270 count
K6 4 32 000 000 9 141 count
K7 4 40 000 000 9 561 count
K8 4 48 000 000 9 599 count
aggregator was used for C1. Two fields, income from stocks (C2) and hourly wage (C3)
were the measures for the aggregator SUM . The query used to generate cube C2 and
details of the original data are given in Appendix A.1 and Appendix B. This data set has
been widely used [Ben 06b, Kase 08, Lemi 10, Aoui 07].
Census 1881
The Census 1881 data set came from a publicly available SPSS file 1881_sept2008_SPSS.rar
that was converted to a flat file [Lemi 09]. In the process the special values “ditto” and
“do” were replaced by the repeated value and all commas within values were deleted.
Cube C4 has seven dimensions ranging in cardinality from 138 to 152 882. It has over 4
million records.
67
text with digit:digit at the beginning of a line and paragraphs are delimited by empty lines.
Occurrence of row w1 , w2 , w3 , w4 indicates that the first paragraph of a verse contains
words w1 through w4 , in this order. This data is a scaled-up version of word co-occurrence
cubes used to study analogies in natural language [Turn 05, Kase 06]. K1 was extracted
from KJV-4grams. Four subcubes of K1 were also processed: K5: first 24 000 000 rows;
K6: first 32 000 000 rows; K7: first 40 000 000 rows and K8: first 48 000 000 rows.
KJV-10grams has similar properties to KJV-4grams, except that there are 10 words in
each row and the process of creating KJV-10grams was terminated when 1 billion records
had been generated—at the end of Genesis 3:15. Cube K3 was extracted from KJV-
10grams. Duplicate records were removed using the post-processing step as described
in Appendix A.4. A count of each unique sequence was kept during post-processing and
became the measure for cubes K2 and K4. The statistics for all eight cubes are given in
Table 5.3.
Forest Cover
As with Census Income, the Forest Cover data came from the UCI KDD Archive [Hett 00].
The first 11 dimensions in the data set were extracted using a Java application and the re-
sulting comma separated file was sorted with its duplicate records removed. On inspection,
there were a few corrupted records containing parentheses, ampersands or periods, which
were handled as described in Appendix A.3. Cardinalities of the dimensions of cube F1
range from 2 to 5 786. The cube has 580 950 cells.
Weather
Surface synoptic weather reports for the entire globe for the period December 1st, 1981 to
November 30th, 1991 were downloaded from The Carbon Dioxide Information Analysis
Center (CDIAC) [Hahn 04]. Details of the extraction process and a full description of
the data are given in Appendix A.6 and Table A.3. The cardinalities of the dimensions of
68
Table 5.4: Statistics of the synthetic data cubes.
P
Cube dimensions |C| i ni
U1 3 999 987 10 773
U2 4 1 000 000 14 364
U3 10 1 000 000 35 910
S1 3 939 153 10 505
S2 4 999 647 14 296
S3 10 1 000 000 35 616
SS1 3 997 737 74 276
SS2 4 999 995 99 525
SS3 10 1 000 000 248 703
To investigate the effect that data distribution might have on the size and shape of dia-
monds, nine cubes of 1 000 000 cells with varying dimensionality and distribution were
constructed. The first set comprised three cubes whose values in each dimension were
independently generated from a uniform distribution on the range 0 to 3591. This range
P
was chosen to be comparable with the i ni of cubes S1, S2 and S3. For the three-
dimensional cube (U1), 13 duplicate cells were discarded, resulting in a cube of 999 987
cells. The 4-D (U2) and 10-D (U3) cubes had no duplicate cells.
As stated in Section 4.2, real data has often been observed to follow a Zipfian distribution
so the remaining data cubes were generated with a power distribution. Data were generated
using this transformation method: Let r be a random number uniformly distributed in the
69
histogram file
csv fact for binary
table processing
number of
text file:
dimensions
d hash tables
for string
processing
is a random power-law-distributed real number in the range xmin ≤ x < ∞ with expo-
nent α [Newm 05]. Cubes SS1, SS2 and SS3 were generated with an exponent of 2.0 in
order to simulate the 80–20 relationship. The remaining cubes (S1, S2, S3) were gener-
ated with an exponent of 3.5 to vary the skew. Indeed, Newman [Newm 05] observed that
the number of copies of best selling books sold between 1895 and 1965 followed power
distribution with exponent 3.5. Statistics of the synthetic data cubes are given in Table 5.4.
All of the data cubes used in the experiments were stored in comma-separated (CSV)
flat files. Since we do not expect our algorithms to be sensitive to the row order, we did
not reorder the rows prior to processing. As discussed in Chapter 4, the basic algorithm
for building a diamond requires a COUNT (resp. SUM ) of each attribute value in each
iteration. Although the COUNT aggregator is referenced in the following discussion, it
applies equally to the SUM aggregator.
70
The algorithms used in our experiments required different preprocessing of the cubes and
the general preprocessing method is illustrated in Figure 5.1. For Algorithm SIDD an
in-memory data structure was used to maintain counts of the attribute values. Algo-
rithm IPDD used d sorted files and Algorithm SQLDD referenced the cube stored in a
relational database. Consequently, the preprocessor wrote different kinds of data to sup-
plementary files depending on which algorithm was to be used.
• Algorithm SIDD
The preprocessor creates d hash tables mapping attribute values to their counts and
all attribute values are treated as character strings. The mappings are converted to
strings and stored in a flat file. Each (key, value) pair is delimited by a tab and stored
on a separate line. A comma-separated file of normalised integers is also generated,
which is used as input to the SQL-based Algorithm SQLDD.
For the binary variation of Algorithm SIDD, the preprocessor creates two binary
files: a normalised file of packed integers, and one storing d ragged arrays mapping
the normalised attribute values to their counts (a histogram).
• Algorithm IPDD
Kumar’s approach was to use two sorted files, one per dimension, in his Web-
trawling algorithm. This work generalises his algorithm to d dimensions, so the
preprocessor for Algorithm IPDD creates d files each sorted on a different dimen-
sion. Two sets of files are output: one of normalised binary data and the other,
character data that are not normalised.
• Algorithm SQLDD
The preprocessor for SQL1 creates a table in a local database and loads the comma-
separated file into the table. It also makes a copy of the table as this algorithm deletes
cells from the original cube. The preprocessor for SQL2/SQL3/SQL4 creates a
71
Table 5.5: Wall-clock times (in seconds) for preprocessing each real-world data set. DB-
Count and DBSum implement Algorithm SIDD for both string and binary data. A ‘—’
indicates that this algorithm was not applied to the corresponding data cube.
The preprocessing of the cubes was timed separately from diamond building. Prepro-
cessed data could be used many times, varying the value for k, without incurring addi-
tional preparation costs. Table 5.5 summarises the times needed to preprocess each cube in
preparation for the algorithms that were run against it. For comparison, sorting the Netflix
comma-separated data file—using the GNU sort utility—took 29 minutes. All preproces-
sors were written in less than 250 lines of Java code using SDK 1.5.0. The timings were
72
taken under the same settings and on the same machine described in Section 5.2
Two algorithms did not fare as well in computing diamonds as the others. Kumar (Algo-
rithm IPDD) did not work well with high dimensional cubes. It required a sorted copy
of the cube for each dimension which incurred additional space and I/O costs. SQL1 was
much slower than the other SQL variants: we only ran it against a few cubes. We did not
implement a SUM-diamond-builder with Kumar or the SQL variants and chose a subset of
the remaining cubes against which to run SQL2 and SQL3.
To test the theory of Chapter 3, and in particular Proposition 3.6, the κ-diamond was built
for each of the data sets. Three strategies had been identified to find κ. All approaches
used the initial guess (k) for κ as the value calculated using Proposition 3.6. When a non-
empty diamond was built, and for every data set this was indeed the case, k was repeatedly
doubled until an empty cube was returned and a tighter range for κ had been established.
The first approach was a basic binary search, which used the newly discovered lower and
upper bounds as the end points of the search space. Each time a non-empty diamond
was returned, it was used as the input to the next iteration of the search. When the guess
overshot κ and an empty diamond was returned, the most recent non-empty cube was used
as the input.
The second approach also made use of the new lower bounds that had been established
for κ. This time k was incremented by one as long as a non-empty diamond was returned.
In this way, the previously generated diamond could be reused at every iteration. Despite
not having to start with the original cube, except on the first iteration, it quickly became
apparent that this approach was not practical. If left to continue it would have taken
approximately 60 days to establish κ for K1 as each iteration took between 100 and 200
seconds and 34 918 iterations would have executed before returning an empty diamond.
73
The third approach was to use an adaptation of the Pegasus method that was discussed in
Chapter 4. Unfortunately, this approach failed to produce a better result than the simple
binary search. A zero for the ‘function’ is already known (an empty diamond was pro-
duced at the upper boundary) and applying Equation 4.1 directly gives the known upper
boundary as the next value for x3 . If f2 is given an arbitrary value, say −f1 then the search
range is effectively halved producing a similar result to binary search. By choosing a value
close to zero, or much smaller than −f1 , for f2 several values close to the upper bound-
ary (resp. the lower boundary) were searched, and more iterations were required to find κ
than when using binary search. Table 5.6 illustrates the results from applying a modified
Pegasus algorithm to cube C1.
Table 5.6: The modified Pegasus search was no more efficient than binary search.
search iters
modified Pegasus (−f1 ) 9
modified Pegasus (−1) 20
modified Pegasus (2 ∗ −f1 ) 13
binary 8
The modified binary search approach was used to establish that κ is 125 433 for cube
K1. The initial k-value chosen was the lower bound calculated from the equation given
in Proposition 3.6:
This did indeed return a non-empty diamond and the k-value was doubled repeatedly until
an empty diamond resulted (173 264). An upper bound for κ had now been established.
Since a non-empty diamond had been returned when using k = 86 632, a binary search
was executed over the range 86 632 to 173 264.
74
Table 5.7: The number of iterations it took to determine κ for COUNT-based diamonds.
iters to estimated
cube convergence κ κ
TW1 6 19 38
TW2 10 5 37
NF1 19 197 1 004
NF3 17 39 272
C1 8 282 672
F1 29 34 92
W1 13 2 447 4 492
W3 17 50 178
K1 8 10 829 125 433
K3 6 637 042 9 464 226
K5 6 2 900 6 603
K6 13 3 500 21 170
K7 11 4 182 25 091
K8 5 5 000 54 769
Table 5.8: The number of iterations it took to determine κ on SUM-based diamonds. The
estimate for κ is the tight lower bound from Proposition 3.7.
iters to estimated
cube convergence κ κ
TW3 3 85 85
NF2 40 5 3 483
C2 5 1 853 3 600 675
C3 7 9 999 71 464
W2 19 32 20 103
K2 3 1 216 437 1 216 437
K4 5 2 9 610 236
75
Table 5.9: Values remaining in the K3 κ-diamond.
A modified binary search as described for K1 was executed for K3, which established that
the actual value is 9 464 226. As shown in Table 5.10, the cardinalities of the κ-diamond
are quite small (7–10) and the values for dimensions 1, 6 and 10 are given in Table 5.9. The
values for κ(K1) can be found in Appendix E. The definition of a verse used to generate
the King James Bible cubes gave rise to an interesting result—a single large paragraph
from the book of Joshua chapter 15 gave rise to the κ(K1) diamond.
Following the procedure outlined in Section 5.5.1, κ was established for the other data
cubes: statistics are provided in Table 5.7. The estimate of κ comes from Proposition 3.6
and the iterations recorded is the number needed to find the κ-diamond. The estimates for
κ varied between 4% and 50% of the actual value and there is no clear pattern to indicate
why this might be. Two very different cubes both had estimates that were 50% of the
actual value: TW1, a small cube of less than 2 000 cells and low dimensionality, and W1,
a large cube of 123 × 106 cells with moderate dimensionality.
76
Table 5.10: Statistics for real-data cubes and their respective κ-diamonds.
% captured
dimension sizes cells
by κ-diamond
NF1 17 766 × 480 189 × 2 182 100 479 586
9%
κ-diamond 3 082 × 6 833 × 1 351 8 654 370
K1 8 246 ×8 387 × 8 416 × 8 504 363 412 308
4%
κ-diamond 66 ×72 ×72 × 65 14 441 597
K3 145 × 165 × 154 × 152 × 156
986 145 104
× 158 × 158 × 157 × 153 × 160
30%
κ-diamond 8 × 10 ×10 ×10 ×10
293 468 482
×9 ×10 × 11 × 9 × 7
W1 28 767 × 2× 4 337 × 6 344 × 8 696×101
123 245 878
×13 ×14 ×11 ×10 ×1 799× 226
47%
κ-diamond 12 155 × 2 × 3 433× 4 746× 5 990
58 389 402
× 13 × 14× 11× 10× 1 591× 162
C1 91 × 9 × 52 × 47 × 17 × 3 × 7 × 24 × 15
× 5 ×10 × 2× 3 × 6 × 8 × 6 × 6 × 51 135 753
× 38 × 8 × 10 × 9 × 10 × 3 × 4 × 7 × 53
42%
κ-diamond 48 × 7 × 25 × 27 ×12× 3 ×5 × 20 × 13
× 4× 6× 2 × 3 × 1 × 4 × 5 × 1 × 1 57 536
×7×6×2×2×2×2×2×7×9
F1 1 978 × 361 × 67 × 553 × 701
580 950
× 5 786 × 207 × 185 × 256 × 5 827 × 2
30%
κ-diamond 908 × 360 ×37 × 175 × 226 × 998
175 275
× 120 × 92 × 185 × 2
77
5.6 Finding κ(C) for SUM-based Diamonds
From Proposition 3.8, we have that mini (maxj (σ(slicej (Di )))) is an upper bound for
κ(C) of any sum-diamond and from Proposition 3.7 a lower bound is the maximum value
stored in any cell. Indeed, for cubes TW3 and K2 the lower bound was the κ value. For
this reason, the approach to finding κ for the sum-diamonds varies slightly in that the first
guess for k should be the lower bound + 1. If this returns a non-empty diamond, then a
binary search over the range [lower bound +1–upper bound] is used to find κ. Statistics
are given in Table 5.8. Although for two cubes the lower bound coincided with their κ(C),
in cube K4 there was no single cell (or slice) that dominated the data, since the maximum
cell value was 2. In a cube with 10 dimensions and ≈ 9 × 108 cells, this resulted in the
lower bound being six orders of magnitude from κ(K4).
Neither the initial file size nor the number of cells pruned from each k-diamond alone
explains the time necessary to generate each diamond. As can be seen in Figure 5.2 more
time is expended when k is close to κ. When k is much larger than κ we converge faster
because more attribute values are pruned in early iterations, thus quickly reducing the
file size. We also see the converse, a few iterations over large files may take more time
than several iterations over small files, as evidenced by the time taken to process the five
iterations when k = 282 on cube C1. In fact, the observed run-time was proportional to
the total number of cells processed, as illustrated in Figure 5.3.
Table 5.11 compares the speed of the string and binary versions of Algorithm SIDD run
against the larger data sets. Times were averaged over ten runs and then normalised against
the binary version, which was up to three times faster. File sizes of the binary data were
smaller and working only with primitive data types meant that there was no expensive
object manipulation necessary. In Table 5.12 we see that Kumar—Algorithm IPDD—
78
Cube C1
8
673
7
282 677
time in seconds
5 633
672
4 κ
704
3
2 846
4 6 8 10 12 14 16 18
iterations
Cube W1
9000
8000 4496
7000
time in seconds
6000
4500
5000 κ
4000 4492
4473
3000 4435
3670
2000 2447
1000
0 5 10 15 20 25 30 35
iterations
Figure 5.2: Times and iterations needed to generate diamonds with different k-values
(labelled). In each case more time and iterations are required for k-values that slightly
exceed κ.
79
Cube C1
700000 673
600000 282
total cells processed
677
500000 633
672
400000 κ
704
300000
200000
846
100000
2 3 4 5 6 7 8
time (in seconds)
Cube W1
1.8e+09
1.6e+09 4496
total cells processed
1.4e+09
1.2e+09 4500
1e+09
4492
8e+08 4473
4435 κ
6e+08 3670
2447
4e+08
2e+08
1000 2000 3000 4000 5000 6000 7000 8000
time (in seconds)
80
Table 5.11: Comparison of string and binary algorithms to compute the κ-diamond. Times
were averaged over ten runs and then normalised against the binary execution times. All
times reported are in seconds.
was between 1.4 and 27 times slower than the binary implementation. As discussed in
Section 4.2.4, we tested four alternative implementations of the SQL-based approach—
Algorithm SQLDD. The first version (SQL1) deleted cells from a copy of the fact table.
It had very poor performance—up to 175 times slower than our binary implementation
for cube NF3. A second version (SQL2), one that updated a boolean-valued attribute
when cells were pruned—instead of deleting the row—had times comparable to the string
version for the smaller cubes, but its performance still lagged behind our binary imple-
mentation on the larger cubes. A third version (SQL3) rebuilt the table when 50% of the
cells were marked for deletion. It did have improved running times, compared to SQL2,
over the large cubes, so we implemented a modified version that rebuilt the table whenever
75% of the remaining cells were marked for deletion. This variation showed an improved
performance of 10% over SQL3 for cube K6 and 40% for cube K8. However, SQL4 was
still more than 25 times slower than our binary implementation on the large data cubes.
Preprocessing for each SQL version are given in Table 5.13. Generally, building indexes
significantly increases preprocessing time. Processing times are given in Table 5.14. We
generated cubes K5, K6, K7 and K8 to see the difference in conclusions that handling
more and more data would give. We see in Table 5.15 that the trend in slowdowns is con-
sistent. The fluctuation in binary times is explained by the total number of cells processed
for each cube—more than 106×106 for cube K5, 79×106 for cube K6, 103×106 for cube
81
Table 5.12: Relative slowdown of algorithms compared to the binary version. Times
were averaged over ten runs and then normalised against theN
binary execution times. SQL
processing for NF1 was forcibly terminated after 26 hours ( ).
K7 and 68 × 106 for cube K8. It is not clear why the difference in slowdown between K5
and K6 jumps an order of magnitude for SQL2: the query plans executed by MySQL were
identical. Also, experiments that reduced the amount of RAM available when processing
K5, did not result in a measurable difference in time to execute. However, more than twice
as many iterations were required for K6 to converge as K5; at the same time 30% more
data was processed.
We did not run SQL4 against cubes whose diamond retained more than 75% of their cells.
To test the effect of different database engines on our code, we also implemented SQL4
for Microsoft SQLServer and compared it with a MySQL version. These experiments
were carried out on a different machine than the others, and the results are presented in
Appendix B.1.1. We found that MySQL was nearly twice as fast as SQL Server for this
particular test.
In our real-data experiments the size (in cells) of the κ-diamond of the high-dimensional
cubes is large, e.g. the κ-diamond for K3 captures almost 30% of the data. How can
we explain this? Is this property a function of the number of dimensions? To answer this
question the κ-diamond was generated for each of the synthetic cubes. Estimated κ, its real
value and the size in cells for each cube are given in Table 5.16. The κ-diamond captures
99% of the data in cubes U1, U2 and U3—dimensionality has no effect on diamond size
82
Table 5.13: Comparison of preprocessing times for SQL implementations. Times are
averaged over 10 runs. All times reported are in seconds.
SQL1 SQL2/SQL3/SQL4
TW1 4.0 × 10−1 4.0 × 10−1
TW2 8.0 × 10−1 4.0 × 10−1
NF3 2.5 × 101 8.8 × 102
C1 7.0 × 10−1 5.0 × 100
C4 2.0 × 101 1.9 × 102
F1 1.5 × 100 4.1 × 101
W3 2.3 × 100 7.5 × 101
K5 2.2 × 101 1.1 × 103
K6 — 1.4 × 103
K7 — 1.8 × 103
K8 — 2.2 × 103
Table 5.14: Comparison of SQL implementations. Times are averaged over 10 runs. For
Cube K5† fewer than 50% of cells were marked for deletion so there was no table rebuild.
All times reported are in seconds.
Table 5.15: Relative slowdown of the SQL algorithms compared to the binary implemen-
tation.
K5 K6 K7 K8
binary time (secs) 150 91 120 65
SQL2 slowdown 3.2 38 39 95
SQL3 slowdown 3.2 30 32 88
SQL4 slowdown — 26 27 53
83
Table 5.16: High dimensionality does not affect diamond size.
for these uniformly distributed data sets. Likewise, dimensionality did not affect the size
of the κ-diamond for the skewed data cubes as it captured between 23% and 26% of the
data in cubes S1, S2 and S3 and between 12% and 16.5% in the other cubes. These
results indicate that the distribution of the data is more of a factor in determining how
much of the data is captured by a diamond dice.
84
1.6e+07 1004 carats
1005 carats
1.4e+07 1006 carats
1.2e+07
1e+07
cells left
8e+06
6e+06
4e+06
2e+06
0
10 20 30 40 50
iteration
(a) Cube NF1, computing 1 004-, 1 005- and 1 006-carat diamonds.
7e+07
20103 carats
20104 carats
6e+07 20105-carats
5e+07
cells left
4e+07
3e+07
2e+07
1e+07
0
5 10 15 20 25 30 35 40
iteration
(b) Cube W2, computing 20 103-, 20 104-, and 20 105-carat diamonds.
85
iteration for 1 004–1 006 carats. The initial number of cells in each cube is not shown in
the figure—84% of the cells were removed during the first iteration of NF1 and 46% were
removed from W2—to include them would distort the plots. The curve for 1 006 reaches
zero first, followed by that for 1 005. Since κ(NF1) = 1 004, that curve stabilises at a
nonzero value. We see a similar result for W2 in Figure 5.4b where κ(W2) = 20 103. It
takes longer to reach a critical point when k only slightly exceeds κ.
Intuitively, one should not be surprised that more iterations are required when k ≈ κ(C):
attribute values that are almost in the diamond are especially sensitive to other attribute
values that are also almost in the diamond.
The number of iterations required until convergence for all our synthetic cubes was also far
smaller than the upper bound, e.g. cube S3: 35 616 (upper bound) and 14 (actual). We had
expected to see the uniformly distributed data taking longer to converge than the skewed
data (Section 4.2.2). This was not the case: in fact the opposite behaviour was observed.
(See Table 5.16.) For cubes U1, U2 and U3 the diamond captured 98% of the cube: less
than 23 000 cells were removed, suggesting that they started with a structure very like a
diamond but for the skewed data cubes—S1, S2, S3, SS1, SS2 and SS3—the diamond
was more “hidden”. Figure 5.5 shows how many cells were removed each iteration from
U1 and SS1. The figures are plotted on a log scale, since more than 73% of the cells were
removed on the first iteration for both U1 and SS1.
How robust is diamond dicing against random noise that models the data-warehouse prob-
lem [Thom 02] of missing data? In Section 3.6 we saw that, in the worst case, dealloca-
tion of a single cell could dramatically affect the diamond. How sensitive is κ to varying
amounts of missing data in real data cubes?
We experimented with cubes TW1, F1, NF3 and SS2 and built new cubes, which are
86
1e+06
100000
cells removed
10000
1000
100
0 2 4 6 8 10 12 14 16
iteration
(a) Cube SS1
100000
10000
cells removed
1000
100
1 2 3 4
iteration
(b) Cube U1
87
collectively labelled as TW10 , F10 , NF30 and SS20 in Table 5.17. Existing data points
had an independent probability pmissing of being omitted from the data set. We determined
κ for the modified cubes for 30 tests each with pmissing values between 1% and 5%. The
results are presented in Table 5.17. In each case, κ(C) of the modified cube was less than
the original κ-value. For example, of the 30 runs for cube NF3 with 5% missing data, 23
times κ(NF30 ) was 259, 4 times it was 260 and for the remaining 3 it was 258: so κ(NF30 )
of each modified cube was within 6% of the true value (272). Although noise degraded the
result for all cubes, the worst observed was for TW1 where κ(TW10 ) was only once more
than 8% different than κ(TW1). These results suggest that diamond dicing in general, and
κ in particular, is robust against modest amounts of missing data.
88
Table 5.17: Robustness of κ(TW1) and κ(SS2) under various amount of randomly miss-
ing data: for each probability, 30 trials were made. Each column is a histogram of the
observed values.
(a) κ(SS2) is 175 with no missing data (b) κ(NF3) is 272 with no missing data
0
κ(SS2 ) Prob. of cell’s deallocation κ(NF30 ) Prob. of cell’s deallocation
1% 2% 3% 4% 5% 1% 2% 3% 4% 5%
175 270 28
174 1 269 2
173 29 268 3
172 6 267 27
171 24 265 8
170 12 264 22
169 18 263 3
168 22 262 12
167 8 7 261 15
166 21 260 4
165 2 259 23
164 258 3
(c) κ(TW1) is 38 with no missing data (d) κ(F1) is 92 with no missing data
κ(TW10 ) Prob. of cell’s deallocation κ(F10 ) Prob. of cell’s deallocation
1% 2% 3% 4% 5% 1% 2% 3% 4% 5%
38 19 12 3 2 91 30
37 10 17 17 10 4 90 30
36 1 1 10 16 18 89 30 1
35 2 7 88 29 1
34 1 87 29
89
Chapter 6
Related Problems
Finding dense subcubes in large data is a difficult problem [Lee 06, Barb 01, Ben 06a].
We investigate the role that diamond cubes might play in the solution of three NP-hard
problems, each of which seeks a dense subcube, and show that the diamond—while per-
haps not providing an exact solution—is a good starting point. The first problem, finding
the Largest Perfect Subcube, assumes use of the aggregator COUNT whilst for the remain-
ing problems—Densest Cube with Limited Dimensions and Heaviest Cube with Limited
Dimensions—SUM is assumed.
A perfect cube contains no empty cells and the perfect cube of size (n1 × n2 × · · · × nd )
is a k1 , k2 , . . . , kd -diamond where each ki = dj=1,j6=i ni . In Proposition 6.1, we show
Q
that finding the largest perfect diamond is NP-hard. A motivation for this problem is
found in Formal Concept Analysis. (See Section 2.4.3.) For example, consider the two-
dimensional cube of Figure 4.1. The set of countries and actions remaining in a perfect
diamond can be thought of as a concept—the largest set of countries (objects) where a
variety of terrorist activities (attributes) occurred.
A subcube S, of a cube C, is obtained by applying the dice operator to C: specifying
90
attribute values in one or more dimensions to retain in the subcube.
Proposition 6.1. Finding a perfect subcube with largest volume is NP-hard, even in two
dimensions.
Finding a diamond might be part of a sensible heuristic to solve this problem. We show in
the next lemma that if a large perfect subcube exists within a data cube, then there is also
a nontrivial large diamond.
This helps in two ways: if there is a nontrivial diamond of the specified size, s1 × s2 ×
· · · × sd we can search for the perfect subcube within it; however, if there is only an empty
diamond of the specified size, there is no perfect subcube.
In the OLAP context, given a cube, a user may ask to “find a subcube with 10 attribute
values per dimension.” Meanwhile, he may want the resulting subcube to have maximal
average—he is, perhaps, looking for the 10 attributes from each dimension that, in combi-
nation, give the greatest profit. We call this problem the H EAVIEST C UBE WITH L IMITED
91
D IMENSIONS (HCLD), which we formalise as: pick min(ni , pi ) attribute values for di-
mension Di , for all i’s, so that the resulting subcube has maximal average, where ni is
the size of dimension i and pi is the required number of attribute values for dimension
i and unallocated cells are counted as zeros. This problem does not restrict the number
of attribute values (pi ) to be the same for each dimension. We have that the HCLD must
intersect with diamonds.
Proposition 6.2. A cube without any non-empty k1 , k2 , . . . , kd -carat subcube has sum less
than di=1 (ni − 1)ki + max(k1 , k2 , . . . , kd ) where the cube has size n1 × n2 × · · · × nd .
P
at which point there is only one cell left. The maximum value this remaining cell can
take is max(k1 , k2 , . . . , kd ). Hence, the sum of the cube is less than di=1 (ni − 1)ki +
P
max(k1 , k2 , . . . , kd ).
Corollary 6.1. Any solution to the HCLD problem having size s1 × s2 × · · · × sd and
average greater than
Pd
i=1 (si − 1)ki + max(k1 , k2 , . . . , kd )
Qd ,
i=1 s i
Proof. The denominator is the volume of the HCLD solution. From Proposition 6.2 we
have that a cube without any k1 , k2 , . . . , kd -sum-carat cube has sum less than di=1 (si −
P
92
greater than
Pd
i=1 (si − 1)ki + max(k1 , k2 , . . . , kd )
Qd
i=1 si
A special case of the HCLD problem, where all measures have the value 1, is the DENSEST
CUBE WITH LIMITED DIMENSIONS . We formalise it as: pick min(ni , pi ) attribute values
for dimension Di , for all i’s, so that the resulting subcube is maximally dense, where ni is
the size of dimension i and pi is the required number of attribute values for dimension i.
Intuitively, a densest cube should at least contain a diamond. We proceed to show that a
sufficiently dense cube always contains a diamond with a large number of carats. From
Proposition 3.1 we have a cube that does not contain a k1 , k2 , . . . , kd -carat subcube has size
at most 1 + di=1 (ki − 1)(ni − 1) and density at most (1 + di=1 (ki − 1)(ni − 1))/ di=1 ni .
P P Q
Corollary 6.2. Any solution of the DCLD problem having density above
1 + (k − 1) di=1 (min(ni , pi ) − 1)
P
Qd
i=1 min(ni , pi )
must share a non-empty k-carat cube in common with the k-carat diamond.
Proof. The denominator is the volume of the DCLD solution and from Corollary 3.1, we
have that a cube having more than 1 + (k − 1) di=1 (ni − 1) allocated cells, must contain
P
a k-carat cube. We also have from the union property of diamonds, Proposition 3.1 that
any cube with k carats is a subcube of the k-diamond. Thus, the intersection between the
DCLD solution and the k-carat diamond cube contains a k-carat cube.
93
1 1 1 1 1
1 1 1 1 1
1 1
1 1
1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
Figure 6.1: Example showing that a diamond (bottom-right quadrant) may not have opti-
mal density.
When ni ≥ p for all i, then the density threshold of Corollary 6.2 is (1+(k−1)d(p−1))/pd :
this value goes to zero exponentially as the number of dimensions increases. Thus, as the
number of dimensions increases, a DCLD solution is guaranteed to intersect with the k-
carat diamond.
We might hope that when the dimensions of the diamond coincide with the required di-
mensions of the densest cube, we would have a solution to the DCLD problem. Alas, this
is not true. Consider the 2-D cube in Figure 6.1. The bottom-right quadrant forms the
largest 3-carat subcube. In the bottom-right quadrant, there are 15 allocated cells whereas
in the upper-left quadrant there are 16 allocated cells. This proves the next result.
Lemma 6.2. Given a cube with size n1 × n2 × · · · × nd and required dimension sizes
p1 , p2 , . . . , pd , even if a COUNT -based diamond has exactly min(ni , pi ) attribute values
for dimension Di , for all i’s, it may still not be a solution to the DCLD problem.
We are interested in large data sets; the next proposition shows that solving DCLD and
HCLD is difficult.
Proof. The E XACT BALANCED PRIME NODE CARDINALITY DECISION PROBLEM (EBP-
NCD) is NP-complete [Dawa 01]—for a given bipartite graph G = (V1 , V2 , E) and a prime
number p, does there exist a biclique U1 and U2 in G such that |U1 | = p and |U2 | = p?
94
Given an EBPNCD instance, construct a 2-D cube where each value of the first dimension
corresponds to a vertex of V1 , and each value of the second dimension corresponds to a
vertex of V2 . Fill the cell corresponding to v1 , v2 ∈ V1 × V2 with a measure value if and
only if v1 is connected to v2 . The solution of the DCLD problem applied to this cube with
a limit of p will be a biclique if such a biclique exists.
A graph problem that is closely related to both DCLD and HCLD is the DENSE SUBGRAPH
problem [Suzu 05,Geor 07]. The NP-hard bipartite DENSE SUBGRAPH problem is defined
as follows: for a bipartite graph (G = (U, V, E)) where each edge e has a non-negative
weight w(e), given m1 and m2 , find a subset X of U of size m1 and a subset Y of V of
size m2 such that the sum of the weights of the subgraph (X, Y, E) is maximised. The
unweighted dense subgraph problem has w(e) = 1 for each edge.
The DENSE SUBGRAPH problem is two-dimensional. The DCLD and HCLD problems are
not restricted to two dimensions. Finding the densest subgraph can be solved in polyno-
mial time [Gall 89] and the problem becomes NP-hard only when you specify how large
the graph should be. The HCLD and DCLD problems require us to specify the number of
attribute values that should be present in any solution and the DENSE SUBGRAPH problem
also requires us to specify the size of the solution. The two-dimensional versions of DCLD
and HCLD are equivalent to the unweighted and weighted DENSE SUBGRAPH problems,
respectively.
The DCLD problem is NP-complete and determining the quality of a heuristic over even
moderately-sized data cubes poses difficulties. Therefore, two different approaches were
taken. In the first, an exact solution was computed for two small, synthetically-generated
95
three-dimensional cubes and compared with the solutions from
• a hybrid, Monte Carlo From Diamond (MCFD), which only differs from Algo-
rithm 6.1 in that it takes a diamond cube as input instead of the whole data cube.
Second, the heuristics were applied to cubes TW1 and F1. To compute an exact solution
for either of these cubes would be prohibitively time consuming.
96
input: d-dimensional cube C, integers p1 , p2 , . . . , pd
output: Cube with size p1 × p2 × . . . × pd
// Use binary search to find k
Find max k where the k-carat diamond ∆ has shape p01 × p02 × · · · × p0d , where
∀ip0i ≥ pi
for i ← 1 to d do
Sort slices of dimension i of ∆ by their σ values
Retain only the top pi slices and discard the remainder from ∆
end
return ∆
Algorithm 6.2: DCLD heuristic that starts from a diamond (DCLDFD).
The first synthetic three-dimensional cube DCLD1 has 92 allocated cells, with attribute
values in the range 0 to 11 for each dimension. It was generated by simulating a roll
of a loaded 12-sided die d (=3) times, to get the coordinates of an allocated cell. The
10
probability of getting a 0 is 3
%, as is the probability of getting a 1 or of getting a 2. For 3,
or for any larger number, the probability of getting that number is 10%. The values were
generated using Algorithm C.1. In all, 100 cells were generated and duplicate cells were
removed. (See Appendix C.1.)
The exact solution to the DCLD(p = 8) problem was found for cube DCLD1 by enumer-
ating all possible cubes of the chosen size and keeping cubes with the maximum density.
The solution, which took over 24 hours to process, is a unique 60-cell cube with den-
sity 0.1172. The κ-diamond was established as the diamond that most closely satisfied
the DCLD constraint, with 9, 8 and 9 values for dimensions 1, 2 and 3 respectively. For
DCLD1 the κ-diamond contained the exact solution. DCLDFD captured 59 cells. A re-
finement to the algorithm that allows for replacing attribute values when there are equally
likely candidates, did find the exact solution. The three bottom ranked attribute values for
dimension 1 had cardinality 5, so were equally likely candidates; likewise, for dimension
3, the bottom 2 ranked attribute values. By choosing each of these attribute values in turn
(6 combinations) and checking the resulting cube’s density, the exact solution was found.
The exact solution to DCLD(p = 8) was found at a cost of generating 6 cubes in contrast
to the 121 287 375 cubes generated by the brute force method.
97
Table 6.1: Comparison of DCLD algorithms.
A second three-dimensional cube, DCLD2, was generated using the same method as for
50
DCLD1. The die weights: the probability of getting a 0 was 3
%, as was the probability
of getting a 1 or of getting a 2. For 3, or for any larger number, the probability of get-
ting that number was 5.5%. Again 100 cells were generated, and this time six duplicates
were removed before processing. (See Appendix C.2.) Fourteen exact solutions to the
DCLD(p = 8) were generated using the brute force method, emphasising that the DCLD
problem need not have a unique solution. Eleven of the fourteen solutions were contained
within the κ-diamond, illustrating also that the diamond need not coincide with a DCLD
solution.
MCH gave poor solutions for cubes DCLD1 and DCLD2, capturing less than 80% of the
exact solution after 1 000 iterations. It was not considered again as a potential heuristic
for this problem.
It is infeasible to compute an exact solution to the DCLD problem, for all but the smallest
values of p, for cubes with non-trivial dimension cardinalities. However, applying the
heuristics to two real data cubes, TW1 and F1, we find that using the diamond as a starting
point gives a better solution than one generated using this Monte Carlo method. Results
for DCLD(p=90) on F1 and DCLD(p=5) on TW1 are given in Table 6.1.
98
MCFD, which uses the κ-diamond as its starting point, does not improve the result of
DCLDFD. Furthermore, it is more expensive to run. An execution of DCLDFD requires
a single pass over the data cube, unless there are equally like candidates in the diamond,
whereas MCFD executes a pass over the data for each iteration. DCLDFD finds an accept-
able solution with little work. On F1 the time difference was significant: 17 seconds for
DCLDFD with 3 swaps and 43 seconds for MCFD. Running MCFD for 1 000 iterations
on this cube would take a very long time—over an hour—and there is no guarantee that
the effort would result in an improved solution.
In conclusion, we have shown that the diamond cube is a good approximation for a solution
to the DCLD problem. The results of our experiments also show that diamonds provide a
speedier solution compared with these Monte Carlo algorithms, whilst giving as good or
better accuracy.
99
Chapter 7
Conclusion
This dissertation establishes properties of the diamond and defines union and intersection
operations over diamonds. Algorithms to extract diamonds have been designed, imple-
mented and tested on real as well as artificially constructed data with predefined distri-
butions. We have also investigated the role diamonds might play in the solution of three
NP-hard problems that seek dense subcubes.
100
provide a timely solution to the multidimensional queries presented in this work. The
binary version of the Java code was more than 25 times faster than our most efficient SQL
implementation. Further, the algorithm and the Java implementation are simple, requiring
less than 275 lines of code.
Our experimental results suggest that diamond dicing and the statistic κ(C) are robust
against missing data.
Related Problems We have proved a relationship between diamonds and the solutions
to three NP-Complete problems:
• Heaviest Cube with Limited Dimensions Any solution to the HCLD problem hav-
ing size s1 × s2 × · · · × sd and average greater than
Pd
i=1 (si − 1)ki + max(k1 , k2 , . . . , kd )
Qd ,
i=1 s i
• Densest Cube with Limited Dimensions Any solution of the DCLD problem hav-
ing density above
1 + (k − 1) di=1 (min(ni , pi ) − 1)
P
Qd
i=1 min(ni , pi )
must share a non-empty k-carat cube in common with the k-carat diamond.
Our experiments with the diamond-based heuristic to solve the DCLD problem show that
the diamond cube is a good approximation for the solution.
101
7.1 Future Research Directions
In the course of diamond cube development and evaluation, several potential avenues for
future work have presented themselves. We have identified three broad areas: faster com-
putation, generalisations and applications of diamonds, together with a couple of varia-
tions: fractional diamonds and the k−plex. A brief overview of these ideas follows.
Although it is faster to compute a diamond cube using our implementation than using the
standard relational DBMS operations or OLAP operators, the speed does not conform to
the OLAP goal of near constant time query execution. Different approaches could be taken
to improve execution speed: compress the data so that more of the cube can be retained in
memory; use multiple processors in parallel; or, if an approximate solution is sufficient,
we might process only a sample of the data.
102
could also exploit the lattice by precomputing some key elements using parallel processing
and then refine the solution to the required diamond.
Approximate Diamonds Second, can we predict the size of a diamond of a given carat
number while examining only a sample of the data cube? Sampling is one technique that
has been used to provide approximate answers to long-running OLAP queries [Babc 03,
Vitt 98] If we make assumptions about the distribution of the data can we get a better
guess? Missaoui et al. [Miss 07] advocate using probabilistic models to query data cubes.
We have identified three new possible generalisations of diamonds: diamonds of higher or-
der; diamonds with pinned attributes and diamonds allowing a limited number of negative
measures.
Diamonds of higher order We consider only diamonds where each slice holds an at-
tribute value constant from a single dimension. However, this concept could be gener-
alised to slices of order δ where a slice is the cube obtained when a single attribute value
is fixed in each of δ different dimensions. For example, a slice of order 0 is the entire
cube, a slice of order 1 is the more traditional definition of a slice [OLAP] and so on. For
a d-dimensional cube, a slice of order d is a single cell.
Diamonds are related to iceberg cubes and we could recover them by obtaining the slice
of order d where σ(x) returns the measure corresponding to cell x. The iceberg query,
introduced by Fang et al. [Fang 98], performs an aggregate function over a specified di-
mension list and then eliminates aggregate values below some specified threshold. Given
a relation R(attr1 , . . . , attrm , rest), the typical iceberg query has the form of Query 7.1.
Consider the Inventory relation of Figure 7.1. Suppose we are interested in the vehicles of
the same make and model in our inventory, but only those cases where there is more than
one available. An iceberg query and its result are given in Query 7.2.
103
SELECT attr1 ,...,attrm , count(rest)
FROM R
GROUP BY attr1 ,...,attrm
HAVING count(rest) > threshold
Inventory
colour make model
red Subaru Forester
red Toyota Camry
red Subaru Outback
blue Toyota Corolla
blue Mazda Miata
blue Subaru Forester
silver Mazda 626
silver Mazda Mazda 3
silver Toyota Corolla
This result could be achieved with a diamond dice. Requiring that all slices of order d have
a sum of k would result in the iceberg that contains cells whose measures are at least k.
Query result
SELECT make, model, count(Colour) make model count(colour)
FROM Inventory
Subaru Forester 2
GROUP BY make, model
HAVING count(Colour) ≥ 2 Toyota Corolla 2
Query 7.2: Iceberg query applied to the Inventory relation of Figure 7.1 and its result.
104
cube. To illustrate, we return to the example provided in Example 1.1: A company wants
to close shops and terminate product lines simultaneously. The CEO wants the maximal
set of products and shops such that each shop would have sales greater than $400 000,
and each product would bring revenues of at least $100 000—assuming we terminate all
other shops and product lines. Further, suppose that the CEO’s nephews run certain stores
that cannot be closed, and the CEO’s nieces have financial interests in some of the prod-
uct lines. These stores and product lines must be kept; given that, how best to terminate
unconstrained product lines and stores?
Solution to a Restricted kSCP Can kSCP be solved with the restriction that only a con-
stant number of measures are negative? If the number of negative measures is limited, then
each potential removal/reinstatement of attribute values might be checked in a reasonable
time. There may still not be a unique solution to a restricted kSCP, but can we bound the
number of negative measures such that the problem is tractable?
7.1.3 Applications
The diamond identifies an important subset of a data cube and may be useful to supplement
analysis in different areas: as a preprocessor to prune less frequently used items before
applying more traditional data mining techniques, as a way to decompose a data cube, or
as a means of discovering important or closely associated words in a text.
Diamonds and Association Rules Mining One difficulty experienced by data mining
applications is that the search space is very large. Diamond dicing has potential as a
preprocessing step for association rules mining. It would provide a means to prune away
less important attributes. Association rules mining is well-studied [Agra 94, Lian 05] and
it is incorporated into commercial data mining applications [SAS, IBM]. It is often used
in conjunction with market basket data—a set of transactions where each transaction is a
set of items bought by an individual customer. Association rules mining generates a set
105
of rules with user-supplied parameters support and confidence. Given I the set of items
occurring in the set of transactions, D, an association rule is an implication of the form
X → Y , where X ⊆ I, Y ⊆ I, and X ∩ Y = φ.
The rule X → Y holds in the transaction set D with confidence c if c% of transactions in
D that contain X also contain Y . The rule X → Y has support s in the transaction set D
if s% of transactions in D contain X ∪ Y . One of the main disadvantages of association
rules mining is the often vast number of rules that are generated. It is insufficient to
increase the confidence and support thresholds to reduce the number of rules. For example,
by increasing the support threshold, we might reduce the number of itemsets—sets of
X ∪ Y —from which the rules are generated but that, of itself, does not guarantee that the
number of rules generated will be small enough for easy analysis [Hipp 00].
One option for presenting the most relevant rules to the analyst involves pruning the data,
so that the rules are generated from the part of the cube of interest. The user is asked to
define the subcube on which the rules mining should execute. With this smaller area to
mine, the run time of the mining process is reduced [Ben 06b] and although the number
of rules generated may not be fewer, they will be taken from the subcube of interest to
the analyst. The diamond dice operator prunes less frequently used attributes from a data
cube. It can also be used to define a subcube over which association rules are generated.
One conjecture to investigate is that the sets of attributes remaining in the k1 , k2 , . . . , kd -
diamond suggest a level of rule-support: each attribute is related to ki combinations of
attributes in other dimensions.
Diamond cores In network analysis, one topological property of interest [Zhou 07] is
the k-core, removing nodes of degree less than k—in the same way that the diamond
dice removes attribute values whose σ(slice) falls below k. The network is modelled
as an undirected graph, unlike in Kumar et al.’s work [Kuma 99],where the network is
modelled as a directed graph. In Zhou et al.’s approach, nodes are assigned to their c-
shells, i.e. a node belongs to a c-shell if it has degree c and it is connected to other nodes
106
Figure 7.2: k-cores of a network [Zhou 07].
1−core 2−core
3−core 4−core
of degree c. (See Figure 7.2.) The k-core decomposition of a network disentangles its
hierarchical structure by progressively focusing on its central cores. Similarly, a data cube
could be decomposed by removing the largest core (κ-diamond); find the largest core in
the remainder . . . ; continue until all attribute values have been assigned into one core or
another. This concept has applications in social networking and text mining also [Feld 07].
Language Translation We saw in Section 2.5.1 that the diamond operator could be used
as an alternative to a top-k query for selecting items to present in a tag cloud. Similarly, can
the diamond help with language translation or other text analysis? Is there a data cube for
which the κ-diamond represents an important subset of words that can be used to deduce
a translation? For example, there is still much debate about how to decipher the symbols
107
found on tablets from the Indus River civilisation, circa 3000 BC [Raoa 09]. Is there a
way to build a cube such that a κ-diamond links words that are the most important/fre-
quent? We have seen that the diamond dice identified a specific verse in the Bible when
the κ-diamond was generated from a co-occurrence cube in Section 5.5.1. For another
example, evidence-based medicine is an approach to medical research that can involve
the systematic review of published articles using machine learning techniques [Kouz 09].
Is there a role for diamonds in processing the co-occurrence matrices used as one data
representation scheme for this systematic review?
k-plex In social network analysis, one way to relax the assumptions of a clique is to
allow an actor to be a member of a k-plex if they have ties to all but k other mem-
bers [Hann 05]. A similar concept could be applied to diamond dicing—a slice is per-
mitted if it has no more than k empty cells. Is this some kind of dual to our diamond?
108
Glossary
S
diamond the diamond is A∈A A, where A is the set 30
of all cubes (of some starting cube C) having
k1 , k2 , . . . , kd carats.
109
dice defines a subcube S by specifying attribute val- 26
ues D
b 1, D
b 2, . . . , D b i ⊆ Di . The
b d where each D
110
subcube A subcube S, of a cube C, is obtained by apply- 91,
ing the dice operator to C: specifying attribute 111
values in one or more dimensions to retain in the
subcube
111
Appendices
112
Appendix A
Details of the data cubes, and how they were extracted from the raw data, are given in the
following sections.
A.1 Census
113
Table A.1: Census Income data: dimensions and cardinality of dimensions. Shaded di-
mensions and measures are those retained for cubes C1, C2 and C3. Dimension number-
ing maps to those described in the file census-income.names [Hett 00].
dim dim cardinality
d0 91 age
d1 9 class of worker
d2 52 detailed industry recode
d3 47 detailed occupation recode
d4 17 education
d6 3 enrol in edu inst last wk
d7 7 marital stat
d8 24 major industry code
d9 15 major occupation code
d10 5 race
d11 10 Hispanic origin
d12 2 sex
d13 3 member of a labour union
d14 6 reason for unemployment
d15 8 full or part time employment stat
d19 6 tax filer stat
d20 6 region of previous residence
d21 51 state of previous residence
d22 39 detailed household and family stat
d23 8 detailed household summary in household
d25 10 migration code-change in msa
d26 9 migration code-change in reg
d27 10 migration code-move within reg
d28 3 live in this house 1 year ago
d29 4 migration prev res in sunbelt
d30 7 num persons worked for employer
d39 53 weeks worked in year
measures used for C2 and C3
d5 1240 wage per hour
d18 1478 dividends from stocks
unused dimensions
d39 2 year
d16 132 capital gains
d17 113 capital losses
d24 99800 unknown
d30 5 family members under 18
d31 43 country of birth father
d32 43 country of birth mother
d33 43 country of birth self
d34 5 citizenship
d35 3 own business or self employed
d36 3 fill inc questionnaire for veteran’s admin
d37 3 veterans benefits
114
Command A.2 Remove duplicate rows from census data.
sort census_count.csv | uniq > census_count_u.csv
A.2 TWEED
The TWEED data set was downloaded from http://folk.uib.no/sspje/tweed.htm [Enge 07]
in SPSS format. It was transformed using SPSS to a comma-separated flat file, which was
the required input format for all our implementations. Details of the dimensions used are
given in Table A.2. The CSV file was loaded into a MySQL database.
Table A.2: Measures and dimensions of TWEED data. Shaded dimensions are those
retained for TW1. All dimensions were retained for cubes TW2 and TW3 (with total
people killed as the measure for TW3).
Dimension Dimension cardinality
d1 Day 32
d2 Month 13
d3 Year 53
d4 Country 16
d5 Type of agent 3
d6 Acting group 287
d7 Regional context of the agent 34
d8 Type of action 11
d31 State institution 6
d32 Kind of action 4
d33 Type of action by state 7
d34 Group against which the state action is directed 182
d50 Group’s attitude towards state 6
d51 Group’s ideological character 9
d52 Target of action 11
Measure
d49 total people killed
people from the acting group military police
civil servants politicians business executives
trade union leaders clergy other militants
civilians
total people injured
acting group military police
civil servants politicians business
trade union leaders clergy other militants
civilians
total people killed by state institution
group members other people
total people injured by state institution
group members other people
arrests convictions executions
total killed by non-state group at which the state directed an action
people from state institution others
total injured by non-state group
people from state institution others
115
A.3 Forest
Cube F1 was extracted from forestcover.dat, a file created by Liu [Liu 02] from Forest
CoverType stored in the UCI KDD archive [Hett 00] The first eleven columns were copied
into ForestCover.csv, which was then sorted and duplicates removed:
We used the command sed to remove lines parentheses, ampersands and periods.
The resulting file has 580 950 lines.
Unix utilities were used to extract a fact table from the 4-gram and 10-gram files (kj4d.csv
and kj10d.csv) [Kase 08] generated by the kingjames script—kingjames.py—described in
Section 5.3.1. First kingjames4d.csv was sorted and the unique records piped to kj4d.csv
(Command A.4).
Command A.4 Unique records from kingjames4d.csv sorted and piped to kj4d.csv
sort kingjames4d.csv | uniq -c > kj4d.csv
The ‘-c’ flag recorded a count for each instance of a line. This count was used as the
measure in the sum-diamond experiments on cube K1. Further, it was necessary to move
the count value to the final field of the fact table in preparation for loading into the MySQL
database—all fact tables were assumed to have the format attr0,attr1,. . . ,attrd,measure.
All commas were replaced by spaces (Command A.5).
116
Then the fields containing data and the measure were written to a new file (Command A.6).
A.5 Netflix
The Netflix training set was downloaded from http://www.netflixprize.com [Netf 07]. It
comprises 1 770 files in a zipped archive. Each file contains the dates and ratings given by
distinct reviewers to one movie. The first line of the file indicates the movie number and
the remaining data is in CSV format with three columns (reviewer,rating,date).
The files were unzipped and then concatenated, adding a column containing the file num-
ber and moving the rating to the final column. The format of the resulting CSV file was
four columns (movie number, reviewer, date, rating).
A.6 Weather
The weather data set contains 124 million surface synoptic weather reports from land sta-
tions for the 10-year period from December 1981 through November 1991, except January
1982. Each report is 56 characters in length. Files with the format <MONTH><YEAR>L.DAT.Z
were downloaded from http://cdiac.ornl.gov/ftp/ndp026b/ [Hahn 04]. In total 119 data
files were used. The files were unzipped and amalgamated to a single file (Command A.7).
117
82040100,1,8250,29767,71082,71,8,8,5,5,10,-1,800,900,0,0,8,9,4
82040100,1,8250,29767,71082,71,5,10,-1,8,9,4
Figure A.1: Sample input line from weather_large.csv and its corresponding output line.
A comma-separated file was created by executing a Java application, which retained the
attributes listed in Table A.3 and excluded attributes 7, and 12–16. (The code is available
at http://www.hazel-webb.com/archive.htm.) The resulting file was loaded into a MySQL
database. Figure A.1 gives a sample input line and its corresponding output line.
Table A.3: Weather data: dimensions and cardinality of dimensions for cubes W1 and
W2.
118
Appendix B
SQL Queries
Tables census_stocks, tweed and weather_large were created by loading the relevant CSV
files into a MySQL database as described in Appendix A. Cubes C2, TW3 and W2 were
extracted from the MySQL database using Listing B.1, Listing B.2 and Listing B.3, re-
spectively. They were also exported in CSV so that they could be used by the other imple-
mentations.
119
Listing B.2: Extract TWEED cube
INSERT INTO tweed15 ( d1 , d2 , d3 , d4 , d5 , d6 , d7 , d8 , d31 , d32 , d33 ,
d34 , d50 , d51 , d52 , d49 )
SELECT d1 , d2 , d3 , d4 , d5 , d6 , d7 , d8 , d31 , d32 , d33 , d34 , d50 , d51 ,
d52 , sum ( d9 )
FROM tweed
GROUP BY d1 , d2 , d3 , d4 , d5 , d6 , d7 , d8 , d31 , d32 , d33 , d34 , d50 , d51 ,
d52
Part of the Java code implementing Algorithm SQLDD is given in Listing B.4. It assumes
that a MySQL database table with the column names “attr1, attr2, . . . attrd, measure, flag”
is populated with the cube of interest.
∗ S e t s a d e l e t e f l a g t o t r u e when c e l l i s d e l e t e d f r o m c u b e and r e b u i l d s w h e n e v e r
∗ 75% o f r e m a i n i n g c e l l s a r e marked f o r d e l e t i o n
∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ∗ /
120
import java . sql . ∗ ;
import java . util . ∗ ;
/∗∗
p r o c e s s f o r a k−c a r a t diamond .
S e t s d e l e t e f l a g t o t r u e as i t p r o c e s s e s .
∗∗/
public void process ( ) {
try {
moreToDelete = true ;
/ / i t e r a t e u n t i l we h a v e f o u n d t h e diamond
while ( moreToDelete ) {
moreToDelete = false ;
/ / loop through dims i n s i n g l e i t e r a t i o n
for ( int dim = 0 ; dim < d ; dim++) {
if ( countAndDelete ( " a t t r " +dim , dim , k ) )
moreToDelete = true ;
}
}
} catch ( Exception e ) {}
/ / c a t c h e x c e p t i o n s − code o m i t t e d
}
/∗∗
countAndDelete
@param c o l d i m e n s i o n name
@param dim d i m e n s i o n number
@param k number o f c a r a t s
t a b l e t a b l e name i s s t a t i c
q u e r i e s t a b l e f o r t h e l i s t o f u n i q u e a t t r i b u t e s and s t o r e s i n t a b l e temp
s e t s d e l e t e f l a g of a t t r i b u t e s with count < k
∗∗/
public static boolean countAndDelete ( String col , int dim , int k ) {
String statement = " " ;
ResultSet countsRS ;
boolean deletedSome = false ;
121
try {
stmt = con . createStatement ( ) ;
countStmt = con . createStatement ( ) ;
deleteStmt = con . createStatement ( ) ;
/ / e n s u r e we h a v e no l e f t o v e r d a t a i n temp
stmt . executeUpdate ( "DELETE FROM temp " ) ;
/ / g e t a t t r i b u t e c o u n t s and s t o r e i n temp
int rows = countStmt . executeUpdate ( " INSERT INTO temp SELECT " + col + "
a s c o l , c o u n t ( " + col + " ) a s c n t FROM " + table + " WHERE f l a g =
f a l s e GROUP BY " + col ) ;
if ( rows > 0 ) {
/ / d e l e t e a t t r i b u t e s w i t h count < k from t a b l e
int del= deleteStmt . executeUpdate ( "UPDATE " + table + " , temp SET
f l a g = t r u e WHERE temp . a t t r 0 = " +table+ " . " +col+ " AND temp . a t t r 1
< " +k ) ;
if ( del > 0 ) {
deletedSome= true ;
cellsDeleted = cellsDeleted + del ;
System . out . println ( " c e l l s D e l e t e d and d e l " + cellsDeleted + " \ t "
+ del ) ;
}
}
} catch ( Exception e ) {
}
/ / c a t c h E x c e p t i o n s − code o m i t t e d
return deletedSome ;
}
We were interested in the effect of other database engines on our algorithms, so we im-
plemented SQL4 on a different test machine. The SQL queries were run against MySQL
version 5.0.41-community and SQL Server 2008 Enterprise edition version 10.0.1600.22
122
on a Dell 1750 with a 2.8 GHz Intel processor 80532K and 2 GiB RAM. It was running
Windows Server 2003.
We used W3 and NF4, which comprised the first 500 000 cells of NF1. Results are given
in Table B.1. Times in seconds are averaged over 10 runs. We see, for this particular test,
that MySQL is almost twice as fast as SQL Server. Although we cannot make a general
statement that all databases are too slow at computing diamonds, we do see that these two
popular RDBMSs are both slower than the binary Java implementation.
Table B.1: Comparison of SQL Server and MySQL. Times (in seconds) are averaged over
10 runs.
123
Appendix C
As discussed in Section 6.4 synthetic cubes were generated to test our DCLD heuristic.
Algorithm C.1 was used to generate cube DCLD1. A similar approach with different
weights was used to generate cube DCLD2. Figures C.1 and C.2 list the cells in cubes
DCLD1 and DCLD2, respectively.
prob ← 0.1
for i ∈ 1 . . . 100 do
for j ∈ 1 . . . d do
// generate a uniform random number in the range
0-119
x ← ran(0 − 119)
if x < 120 ∗ prob then
// generate a uniform random number in the range
0-2
chosenj ← ran(0 − 2)
else
// 90% of the time we will generate a uniform
random number in the range 3-11
chosenj ← ran(3 − 11)
// write chosen to cube DCLD1
Algorithm C.1: Generates DCLD1: values 0–2 have a 10% chance of being chosen
in each dimension.
124
Figure C.1: Cells remaining in DCLD1 after the duplicates were removed.
Figure C.2: Cells remaining in DCLD2 after the duplicates were removed.
125
Appendix D
Part of the Java preprocessing and diamond building applications are given in Listing D.1
and Listing D.2. The preprocessor is common to both string and binary implementations.
We list DBCount_binary.java and note that the string implementation only differs in the
data type and the data structure used to keep track of counts.
private int d ;
/ / number o f d a t a d i m s
private String file ;
/ / s t o r e i n p u t f i l e name
private Vector<Integer > [ ] data ;
/ / data s t r u c t u r e ragged array of V e c t o r s to s t o r e co unts f o r nomalized a t t r i b u t e s
private Hashtable [ ] mappedCoords ;
/ / d s maps a t t r i b u t e v a l u e s t o t h e i r c o u n t s
private Hashtable [ ] mappedNorms ;
/ / ds h e l p s t o t r a n s f o r m data t o n o r m a l i z e d i n t e g e r s
private boolean norm = true ; / / d e f a u l t −−p r e p r o c e s s f o r b i n a r y d a t a
126
private boolean hash = true ; / / d e f a u l t −−p r e p r o c e s s f o r s t r i n g d a t a
public static void main ( String [ ] args ) {
/ / c a l l createHashTables
/∗∗
createHashTables
@param fname S t r i n g o f t h e c s v f a c t t a b l e t o p r o c e s s .
CSV f i l e s h o u l d h a v e no h e a d e r s o r m e t a d a t a .
c r e a t e s d h a s h t a b l e s o f <k e y ( a t t r i b u t e ) , v a l u e ( c o u n t )> p a i r s
c r e a t e s d h a s h t a b l e s o f <k e y ( a t t r i b u t e > , v a l u e ( n o r m v a l u e )> p a i r s
creates array of vectors t h a t store counts for the normalized a t t r i b u t e s
∗∗/
public void createHashTables ( String fname , int d ) {
BufferedReader raw ;
PrintWriter permanent_out ;
/ / i n i t i a l i z e ’ em h a p p e n s r e g a r d l e s s o f o u t p u t c h o i c e . . .
for ( int i = 0 ; i < d ; i++) {
mappedCoords [ i ] = new Hashtable<String , Integer > ( ) ;
mappedNorms [ i ] = new Hashtable<String , Integer > ( ) ;
}
/ / map a l l o f t h e a t t r i b u t e s t o t h e i r c o u n t s
try {
permanent_out = new PrintWriter ( new BufferedWriter ( new
FileWriter ( fname+ "−s t r i n g −t o−norm " ) ) ) ;
raw = new BufferedReader ( new InputStreamReader ( new
FileInputStream ( fname ) ) ) ;
String line = raw . readLine ( ) ;
Integer val = new Integer ( 1 ) ;
Integer [ ] normVal = new Integer [ d ] ;
/ / ds t o s t o r e t h e n o r m a l i z e d v a l u e f o r t h e n e x t key i n each dimension
127
Integer intVal = ( Integer ) mappedCoords [ k ] . get ( key ) ;
mappedCoords [ k ] . put ( key , intVal + val ) ;
/ / i n c r e a s e c o u n t o f t h i s k e y i n mappedCoords h a s h
}
}
if ( norm ) {
if ( ! mappedNorms [ k ] . containsKey ( key ) ) {
/ / i f i t wasn ’ t mapped y e t
} catch ( IOException ie ) {
ie . printStackTrace ( ) ;
}
}
128
Listing D.2: Diamond builder for binary data.
/ ∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
∗ DBCount_binary . j a v a
∗ i n p u t : 1) preprocessed binary f i l e ( o u t p u t of DBCountPreprocessor . java )
∗ format of input f i l e :
∗ number o f dims , c a r d i n a l i t i e s f o r e a c h dim , d a t a − a l l i n t e g e r s
∗ 2 ) k : number o f c a r a t s
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ∗ /
import java . util . ∗ ;
import java . io . ∗ ;
/ / call process
/ / c a l l w r i t e D i a m o n d : w r i t e s c e l l s r e m a i n i n g i n diamond t o c s v f i l e −− c o d e
omitted
/ / c a l l w r i t e D i a m o n d C o u n t s : w r i t e s a l l a t t r i b u t e v a l u e s and t h e i r c u r r e n t
c o u n t t o f i l e −− c o d e o m i t t e d
}
/∗ process
∗ c o u n t s f o r dim i a r e s t o r e d i n d a t a ( i )
∗ a c o u n t o f z e r o r e m o v e s t h i s a t t r f r o m t h e diamond
∗/
public void process ( String inFile , String outFile ) {
try {
FileInputStream file_input = new FileInputStream ( new File ( inFile ) ) ;
DataInputStream data_in = new DataInputStream ( new
BufferedInputStream ( file_input ) ) ;
129
attrs [ i ] = data_in . readInt ( ) ;
/ / read in a l i n e of the f a c t t a b l e
}
del = false ;
/ / we h a v e n ’ t y e t d e l e t e d a n y t h i n g
for ( int i = 0 ; i < d ; ++i ) {
/ / g e t c o u n t s a r r a y f o r dim i
if ( data [ i ] [ attrs [ i ] ] == 0 ) {
del = true ;
reprocess = true ; / /
/ / we d e l e t e d s o m e t h i n g s o we w i l l n e e d t o r e p r o c e s s t h e
file
/ / t h i s a t t r i s d e l e t e d s o u p d a t e o t h e r s i n t h e row
for ( int j = 0 ; j < d ; ++j ) {
if ( j ! = i ) {
if ( data [ j ] [ attrs [ j ] ] > 0 ) {
data [ j ] [ attrs [ j]] − −;
/ / decrement the count
if ( data [ j ] [ attrs [ j ] ] < K )
data [ j ] [ attrs [ j ] ] = 0 ;
/ / t h i s s h o u l d be d e l e t e d on n e x t i t e r
}
}
}
if ( del ) break ;
/ / o n l y d e l e t e t h i s row o n c e
}
}
if ( ! del ) {
for ( int v : attrs ) {
data_out . writeInt ( v ) ;
/ / w r i t e t h i s row t o new diamond f i l e
}
}
}
} catch ( EOFException eof ) {
/ / do n o t h i n g t h i s t e r m i n a t e s t h e l o o p
}
data_in . close ( ) ;
data_out . close ( ) ;
if ( reprocess ) {
if ( inFile . charAt ( FILE_NAME_LENGTH ) == ’ _ ’ )
( new File ( inFile ) ) . delete ( ) ;
/ / and d e l e t e t h e o l d f i l e e x c e p t o r i g i n a l d a t a
inFile = inFile . substring ( 0 , FILE_NAME_LENGTH ) + " _ " +iters + "B" ;
/ / f o r c e a new f i l e t o be o u t p u t e a c h t i m e
process ( outFile , inFile ) ;
/ / c a l l process again with newly minted data f i l e
}
} catch ( IOException e ) {
System . out . println ( " IO E x c e p t i o n = : " + e ) ;
}
}
130
Appendix E
The κ-diamond for K1 was established as described in Section 5.5.1 and the words re-
maining in the κ-diamond are given in Tables E.1, E.2, E.3 and E.4. Words that occur in
one dimension, but none of the others are shown in boldface.
131
Table E.2: Words remaining in dimension two of κ(K1) diamond.
132
Table E.4: Words remaining in dimension four of κ(K1) diamond.
133
Bibliography
[Agra 94] R. Agrawal and R. Srikant. “Fast Algorithms for Mining Association Rules”.
In: J. B. Bocca, M. Jarke, and C. Zaniolo, Eds., VLDB’94, pp. 487–499,
Morgan Kaufmann, 12–15 1994.
[Aoui 09a] K. Aouiche, D. Lemire, and R. Godin. “Web 2.0 OLAP: From Data Cubes
to Tag Clouds”. Lecture Notes in Business Information Processing, Vol. 18,
pp. 51–64, 2009.
[Aoui 09b] K. Aouiche, D. Lemire, and R. Godin. Web Information Systems and Tech-
nologies, Chap. Web 2.0 OLAP: From Data Cubes to Tag Clouds, pp. 51–64.
Vol. 18 of Lecture Notes in Business Information Processing, Springer, 2009.
[Babc 03] B. Babcock, S. Chaudhuri, and G. Das. “Dynamic Sample Selection for
Approximate Query Processing”. In: SIGMOD’03, pp. 539–550, 2003.
[Barb 01] D. Barbará and X. Wu. “Finding Dense Clusters in Hyperspace: An Ap-
proach Based on Row Shuffling.”. In: Advances in Web-age Information
Management, pp. 305–316, 2001.
134
[Ben 06a] R. Ben Messaoud, O. Boussaid, and S. Loudcher Rabaséda. “Efficient Multi-
dimensional Data Representations Based on Multiple Correspondence Anal-
ysis”. In: KDD’06, pp. 662–667, 2006.
[Benn 07] J. Bennett and S. Lanning. “The Netflix Prize”. In: KDD Cup and Workshop
2007, 2007.
[Borz 01] S. Börzsönyi, D. Kossmann, and K. Stocker. “The Skyline Operator”. In:
ICDE ’01, pp. 421–430, IEEE Computer Society, 2001.
[Care 97] M. J. Carey and D. Kossmann. “On Saying “Enough already!” in SQL”. In:
SIGMOD’97, pp. 219–230, 1997.
[Cerf 09] L. Cerf, J. Besson, C. Robardet, and J.-F. Boulicaut. “Closed Patterns Meet
N-ary Relations”. ACM Trans. Knowl. Discov. Data, Vol. 3, No. 1, pp. 1–36,
2009.
[Chak 06] D. Chakrabarti and C. Faluoutsos. “Graph Mining: Laws, Generators, and
Algorithms”. ACM Computing Surveys, Vol. 38, pp. 1–69, 2006. Article 2.
[Chen 05] Y. Chen, T. Eavis, F. Dehne, and A. Rau-Chaplin. “PnP: Parallel and External
Memory Iceberg Cube Computation”. In: ICDE’05, pp. 576–577, 2005.
[Codd 70] E. F. Codd. “A Relational Model of Data for Large Shared Data Banks”.
Commun. ACM, Vol. 13, No. 6, pp. 377–387, 1970.
[Corm 05] G. Cormode and S. Muthukrishnan. “What’s Hot and What’s Not: Tracking
Most Frequent Items Dynamically”. ACM Trans. Database Syst., Vol. 30,
No. 1, pp. 249–278, 2005.
[Coze 04] O. Cozette, A. Guermouche, and G. Utard. “Adaptive Paging for a Multi-
frontal Solver”. In: ICS ’04: Proceedings of the 18th annual international
conference on Supercomputing, pp. 267–276, ACM, New York, NY, USA,
2004.
135
[Dawa 01] M. Dawande, P. Keskinocak, J. M. Swaminathan, and S. Tayur. “On Bipartite
and Multipartite Clique Problems”. Journal of Algorithms, Vol. 41, No. 2,
pp. 388–403, November 2001.
[Dell 09] Dell. Specifications: Hitachi Deskstar P7K500 User’s Guide. 2009.
https://support.dell.com/support/edocs/storage/P160227/specs.htm(Last
checked 06–9-2010).
[Dowe 72] M. Dowell and P. Jarratt. “The ‘Pegasus’ Method for Computing the Root
of an Equation”. BIT Numerical Mathematics, Vol. 12, No. 4, pp. 503–508,
Dec. 1972.
[Enge 07] J. O. Engene. “Five Decades of Terrorism in Europe: The TWEED Dataset”.
Journal of Peace Research, Vol. 44, No. 1, pp. 109–121, 2007.
[Feld 07] R. Feldman and J. Sanger. The Text Mining Handbook. Cambridge University
Press, 2007.
[Feng 04] J. Feng, Q. Fang, and H. Ding. “PrefixCube: Prefix-sharing Condensed Data
Cube”. In: DOLAP ’04, pp. 38–47, ACM Press, New York, NY, USA, 2004.
[Godi 95] R. Godin, R. Missaoui, and H. Alaoui. “Incremental Concept Formation Al-
gorithms Based on Galois (Concept) Lattices”. Computational Intelligence,
Vol. 11, pp. 246–267, 1995.
[Gray 96] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. “Data Cube: A Relational
Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total”.
In: ICDE ’96, pp. 152–159, 1996.
136
[Hahn 04] C. Hahn, S. Warren, and J. London. “Edited Synoptic Cloud Reports from
Ships and Land Stations over the Globe, 1982–1991”. http://cdiac.ornl.gov/
ftp/ndp026b/ (Last checked 06-09-2010), Jan. 2004.
[Hale 01] A. Y. Halevy. “Answering queries using views: A survey”. The VLDB Jour-
nal, Vol. 10, No. 4, pp. 270–294, 2001.
[Hett 00] S. Hettich and S. D. Bay. “The UCI KDD archive”. http://kdd.ics.uci.
edu (Last checked 06-09-2010), 2000.
[Hipp 00] J. Hipp, U. Güntzer, and G. Nakhaeizadeh. “Algorithms for Association Rule
Mining — a General Survey and Comparison”. SIGKDD Explor. Newsl.,
Vol. 2, No. 1, pp. 58–64, 2000.
[Hirs 05] J. E. Hirsch. “An Index to Quantify an Individual’s Scientific Research Out-
put”. 2005. doi:10.1073/pnas.0507655102 (Last checked 07-29-2009).
[Holl 07] A. L. Holloway, V. Raman, G. Swart, and D. J. DeWitt. “How to Barter Bits
for Chronons: Compression and Bandwidth Trade Offs for Database Scans”.
In: SIGMOD’07, pp. 389–400, 2007.
[Ilya 08] I. F. Ilyas, G. Beskales, and M. A. Soliman. “A Survey of Top-k Query Pro-
cessing Techniques in Relational Database Systems”. ACM Comput. Surv.,
Vol. 40, No. 4, pp. 1–58, 2008.
[ISO 08] ISO 9075-1:2008. Information technology: database languages – SQL– Part
1 Framework. ISO, Geneva, Switzerland, 3rd Ed., 2008.
[Kase 06] O. Kaser, S. Keith, and D. Lemire. “The LitOLAP Project: Data Warehousing
with Literature”. In: CaSTA’06, 2006.
137
[Kase 07] O. Kaser and D. Lemire. “Tag-Cloud Drawing: Algorithms for Cloud Visu-
alization”. In: WWW 2007 – Tagging and Metadata for Social Information
Organization, 2007.
[Kase 08] O. Kaser, D. Lemire, and K. Aouiche. “Histogram-Aware Sorting for En-
hanced Word-Aligned Compression in Bitmap Indexes”. In: DOLAP ’08,
2008.
[Klug 82] A. Klug. “Equivalence of Relational Algebra and Relational Calculus Query
Languages Having Aggregate Functions”. Journal ACM, Vol. 29, No. 3,
pp. 699–717, 1982.
[Knut 97] D. E. Knuth. Fundamental Algorithms. Vol. 1 of The Art of Computer Pro-
gramming, Addison-Wesley, 1997.
[Kouz 09] A. Kouznetsov, S. Matwin, D. Inkpen, A. H. Razavi, O. Frunza, M. Se-
hatkar, and L. Seaward. Advances in Artificial Intelligence, Chap. Classifying
Biomedical Abstracts Using Committees of Classifiers and Collective Rank-
ing Techniques, pp. 224–228. Vol. 5549/2009 of Lecture Notes in Computer
Science, Springer Berlin / Heidelberg, 2009.
[Kuma 99] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. “Trawling the Web
for Emerging Cyber-communities”. In: WWW ’99, pp. 1481–1493, Elsevier
North-Holland, Inc., New York, NY, USA, 1999.
[Lee 06] S.-L. Lee. “An Effective Algorithm to Extract Dense Sub-cubes from a Large
Sparse Cube.”. In: A. M. Tjoa and J. Trujillo, Eds., DaWaK, pp. 155–164,
Springer, 2006.
[Lee 09] S.-W. Lee, B. Moon, and C. Park. “Advances in Flash Memory SSD Technol-
ogy for Enterprise Database Applications”. In: SIGMOD ’09, pp. 863–870,
ACM, New York, NY, USA, 2009.
[Lemi 08] D. Lemire and O. Kaser. “Hierarchical Bin Buffering: Online Local Moments
for Dynamic External Memory Arrays”. ACM Trans. Algorithms, Vol. 4,
No. 1, pp. 1–31, 2008.
[Lemi 09] D. Lemire and O. Kaser. “Reordering Columns for Smaller Indexes”. 2009.
in preparation, available from http://arxiv.org/abs/0909.1346.
[Lemi 10] D. Lemire, O. Kaser, and K. Aouiche. “Sorting Improves Word-aligned
Bitmap Indexes”. Data & Knowledge Engineering, Vol. 69, No. 1, pp. 3–
28, 2010.
[Li 06] C. Li, B. C. Ooi, A. K. H. Tung, and S. Wang. “DADA: a Data Cube for
Dominant Relationship Analysis”. In: SIGMOD’06, pp. 659–670, 2006.
[Lian 05] W. Lian, D. W. Cheung, and S. M. Yiu. “An Efficient Algorithm for Finding
Dense Regions for Mining Quantitative Association Rules”. Computers &
Mathematics with Applications, Vol. 50, No. 3-4, pp. 471–490, August 2005.
138
[Liu 02] R. X. Liu. Simplifying Parallel Datacube Computation. Master’s thesis,
University of New Brunswick, 2002.
[Loh 02a] Z. X. Loh, T. W. Ling, C. H. Ang, and S. Y. Lee. “Adaptive Method for Range
Top-k Queries in OLAP Data Cubes”. In: DEXA’02, pp. 648–657, 2002.
[Luo 94] Z.-Q. Luo and D. Parnas. “On the Computational Complexity of the Maxi-
mum Trade Problem”. Acta Mathematical Applicatae Sinica (English Series),
Vol. 10, No. 4, pp. 434–440, 1994.
[Nade 03] T. P. E. Nadeau and T. J. E. Teorey. “A Pareto Model for OLAP View Size Es-
timation”. Information Systems Frontiers, Vol. 5, No. 2, pp. 137–147, 2003.
[Newm 05] M. Newman. “Power laws, Pareto Distributions and Zipf’s Law”. Contem-
porary Physics, Vol. 46, No. 5, pp. 323–351, 2005.
[OLAP] OLAP Council, The. “OLAP and OLAP Server Definitions”. http://www.
olapcouncil.org/research/resrchly.htm (Last checked 07-29-2009).
139
[Pei 05] J. Pei, M. Cho, and D. Cheung. “Cross Table Cubing: Mining Iceberg Cubes
from Data Warehouses”. In: SDM’05, pp. 461–465, 2005.
[Pens 05] R. G. Pensa and J. Boulicaut. “Fault Tolerant Formal Concept Analysis”. In:
AI*IA 2005, pp. 212–233, Springer-Verlag, 2005.
[Port 97] M. F. Porter. “An Algorithm for Suffix Stripping”. In: Readings in informa-
tion retrieval, pp. 313–316, Morgan Kaufmann, 1997.
[Proj 09] Project Gutenberg Literary Archive Foundation. “Project Gutenberg”. http:
//www.gutenberg.org/ (checked 06-09-2010), 2009.
[Redd 01] P. K. Reddy and M. Kitsuregawa. “An Approach to Relate the Web Commu-
nities through Bipartite Graphs”. In: WISE’01, pp. 302–310, 2001.
[Rumm 70] R. Rummel. Applied Factor Analysis. Northwestern University Press, 1970.
[Suzu 05] A. Suzuki and T. Tokuyama. “Dense Subgraph Problem Revisited”. In: Joint
Workshop “New Horizons in Computing” and “Statistical Mechanical Ap-
proach to Probabilistic Information Processing”, 2005.
140
[Thom 02] E. Thomson. OLAP Solutions: Building Multidimensional Information Sys-
tems. Wiley, second Ed., 2002.
[Vitt 98] J. S. Vitter, M. Wang, and B. Iyer. “Data Cube Approximation and His-
tograms via Wavelets”. In: CIKM ’98, pp. 96–104, New York, U.S.A., 1998.
[Wang 02] W. Wang, J. Feng, H. Lu, and J. X. Yu. “Condensed Cube: An Efficient Ap-
proach to Reducing Data Cube Size”. In: ICDE ’02, p. 155, IEEE Computer
Society, Washington, DC, USA, 2002.
[Webb 07] H. Webb. “Properties and Applications of Diamond Cubes”. In: ICSOFT
2007 – Doctoral Consortium, 2007.
[Xin 03] D. Xin, J. Han, X. Li, and B. W. Wah. “Star-Cubing: Computing Iceberg
Cubes by Top-Down and Bottom-Up Integration”. In: VLDB’03, pp. 476–
487, 2003.
[Yang 05] K. Yang. “Information Retrieval on the Web”. Annual Review of Information
Science and Technology, Vol. 39, pp. 33–81, 2005.
[Zhou 07] S. Zhou and G.-Q. Zhang. “Chinese Internet AS-level Topology”. Commu-
nications, IET, Vol. 1, No. 2, pp. 209 – 214, 2007.
141
CURRICULUM VITÆ
Degrees:
Bachelor of Science in Data Analysis (Computer Science) First Division
University of New Brunswick Saint John 1998–2001
Publications:
Hazel Webb “ Properties and Applications of Diamond Cubes" Proceedings of ICSOFT
International Conference on Software and Data Technologies, Barcelona Spain, July 2007.
Hazel Webb, Owen Kaser and Daniel Lemire “Pruning Attribute Values from Data Cubes
with Diamond Dicing" Proceedings of IDEAS08 Twelfth International Database Engi-
neering and Applications Symposium, Coimbra Portugal, September 2008.
Conference Presentations:
Hazel Webb and James Stewart “Parallelizing Existing Code for Shared and Distributed
Memory Architectures Using OpenMP and MPI" APICS Mathematics/Statistics and Com-
puter Science Joint Conference, St. Francis Xavier University, October 2001.
Hazel Webb “ Interpolation B+ tree: an Efficient File Structure with which to Store and
Search Large Almost Static Data Sets" APICS Mathematics/Statistics and Computer Sci-
ence Joint Conference, UNB Saint John, October 2004.