Persistence: using compressed bitvectors to implement relations.
Massimo Dentico
m.dentico@virgilio.it
Mon, 07 Apr 2003 20:39:58 +0200
USING COMPRESSED BITVECTORS TO IMPLEMENT RELATIONS (D R A F T)
-- Small steps toward a True Relational DBMSs?
This is a note concerning an efficient implementation for relations (as
in relational data model).
Its scope is to raise interest, to receive comments, corrections and
possibly some help in writing a little more detailed paper about it:
notoriously my English is bad (my apology in advance if this e-mail is
difficult to read) and my mathematical background is not strong.
I propose also to consider this technique when we will experiment on
orthogonal persistence in Tunes.
There is nothing new in this proposal but it is really a change in
focus, an elaboration on indexes implementation techniques used, for
example, to give SQL-based DBMSs decent performances for data
-mining/DSS. I propose to implement relations directly with it (and get
rid of SQL and indexes of course).
The idea is to totally reject the concept of record or equivalently, as
pointed out by Fabian Pascal in, for example, [1]:
.. the real solution in the relational spirit is to get rid of
physical rows altogether.
A FIRST STEP is to store relation values by attribute (by column): this
is convenient and efficient, as shown by MetaKit [2] and Kdb [3], but is
not enough.
A SECOND STEP to this aim is to completely separate the encoding
(representation) of domain values (values of a given type) from relation
values. For example, consider the domain of cities: we will store these
cities as characters strings of variable length and separately, into
each relation which has an attribute (column) of domain city, a
"reference" (not a pointer, see below) to these values.
This is implicit in the concept of physical and logical independence and
to see an effective use of it is possible to trace back at least to [4]
(p. 7):
[..]
Reference Numbers
The overall structure of the MADAM system is presented in Figure 1.
It will be noted that the primary division in the system is between
those procedure and data segments concerned with data elements and
those concerned with relations. [..] Whenever a new data element
enters the system, it is immediately assigned a reference number
which is used for all subsequent operations on that element.
[..]
This seems to me similar to some research in "programming with shapes"
[5], but applied to relations:
[..] The fundamental basis for a Shapely computation is the
seperation of shape and data. For example, separating a tree into an
empty tree and a list of the data at the leaves, or separating a
matrix into its dimensions and a list of the entries [..]
A THIRD STEP is to store these attributes as bitmap indexes (bit
-vectors), one for each domain value (for example each city) present
into the relation (one column is mapped to one or more bit-vectors,
depending on the number of values stored in such column). As explained
in [6]:
[..]
1 Introduction
A bitmap index is a bit string in which each bit is mapped to a
record ID (RID) of a relation. A bit in the bitmap index is set (to
1) if the corresponding RID has property P (i.e., the RID represents
a customer that lives in New York), and is reset (to 0) otherwise.
In typical usage, the predicate P is true for a record if it has
the value a for attribute A. One such predicate is associated to one
bitmap index for each unique value of the attribute A. The
predicates can be more complex, for example bitslice indices [..]
and precomputed complex selection predicates [..].
[..]
Storing these bitmap indexes (or bit-vectors) in "raw" format is
inefficient so we will store this bitmaps compressed.
Compressed bit-vectors (CBVs) on average are short sequences of
integers, possibly only one. For example, a compression technique used
in [7] is a modified form of run-length encoding. Selecting one of these
CBVs corresponds to a SQL SELECT on a relation for a particular
attribute value. If this CBV is strongly compressed then this SELECT
corresponds to the extraction of few integers even if the relation
contains MILLIONS OF TUPLES (rows).
Also we will perform other relational operations, which are equivalent
to bitwise logic operations on these CBVs, WITHOUT DECOMPRESSING IT, as
shown in [6] and [7].
Probably there are variations of this implementation technique that
deals with domains with big cardinality or relations with particularly
"strange" distributions of values attribute. This framework seems to me
sufficiently general to encompass these edge cases. This last remark is,
finally, the explanation of my statement about minor necessity, with
this technique, of big implementation variations for relations in the
style of AP5 [8] (I was enable to explain my rationale concisely).
Anyway a subject of further research.
Note that in [7] CBVs are used to analyze the control flow graph of a
given program, activity that is usually not perceived as data-mining. At
a careful eye this clearly shows that relations are not only perfectly
suitable to deal with hierarchies and networks (trees and graphs - common misconceptions notwithstanding) but that with a good
implementation is possible to do this efficiently.
Be warn, this is only a part of a big picture. I am aware that we need
further research if we want to conciliate Tunes with the relational data
model, of which this note cover only an aspect of its implementation (I
hope satisfactorily).
Best regards,
Massimo Dentico
REFERENCES
[1] "More On MultiValue Databases" with Fabian Pascal
- http://www.dbdebunk.com/mvdb_0306.htm
[2] "Object, schmobject...", Jean-Claude Wippler, 1999, equi4 web site:
the rational behind MetaKit, the interesting part is that about
getting rid of records.
- http://www.equi4.com/old/schmoo.html
[3] "Kdb FAQ", Kx Systems
- http://www.kx.com/product_info/high-performance-database.htm#q2
An excerpt:
"[..] What makes kdb a high performance database?
The main reason is that kdb tables are "inverted", i.e. the data in
each column is stored together, instead of in the row orientation
used by most rdbms's. The column data is also organized for optimal
processing speed. The result is an ultra high performance database.
[..]"
[4] "The MacAIMS Data Management System", Robert C. Goldstein, Alois J.
Strnad, 1971, MIT LCS Technical Memo 24 (MIT-LCS-TM-024)
- http://www.lcs.mit.edu/publications/specpub.php?id=23
[5] "The Shape Project"
- http://linus.socs.uts.edu.au/~shape/shape.html
[6] "Optimizing Queries On Compressed Bitmaps", Sihem Amer-Yahia,
Theodore Johnson, 2000, The VLDB Journal
- http://citeseer.nj.nec.com/amer-yahia00optimizing.html
[7] "Boolean logic on compressed bitvectors", Koen Van Damme
- http://users.pandora.be/koen.vandamme1/c_tools/bitplatforms/article.txt
and "Compressed Bitvectors"
- http://users.pandora.be/koen.vandamme1/c_tools/bitplatforms/_index.htm
[8] see AP5 node on CTO (Cliki Tunes Org, our Wiki) for a short
explaination of it
- http://cliki.tunes.org/ap5