Persistence: using compressed bitvectors to implement relations.

Massimo Dentico m.dentico@virgilio.it
Mon, 07 Apr 2003 20:39:58 +0200


USING COMPRESSED BITVECTORS TO IMPLEMENT RELATIONS           (D R A F T)
-- Small steps toward a True Relational DBMSs?

This is a note concerning an efficient implementation for relations  (as
in relational data model).

Its scope  is to  raise interest,  to receive comments,  corrections and
possibly some  help in  writing a  little more  detailed paper about it:
notoriously my English is bad  (my apology in advance  if this e-mail is
difficult to read) and my mathematical background is not strong.

I propose  also to  consider this  technique when  we will experiment on
orthogonal persistence in Tunes.

There is  nothing new  in this  proposal but  it is  really a  change in
focus, an  elaboration on  indexes implementation  techniques used,  for
example,  to   give  SQL-based   DBMSs  decent   performances  for  data
-mining/DSS. I propose to implement relations directly with it (and  get
rid of SQL and indexes of course).

The idea is to totally reject the concept of record  or equivalently, as
pointed out by Fabian Pascal in, for example, [1]:

  ..  the  real solution  in  the relational  spirit  is to  get  rid of
  physical rows altogether.


A FIRST STEP  is to store relation values by attribute (by column): this
is convenient and efficient, as shown by MetaKit [2] and Kdb [3], but is
not enough.


A  SECOND  STEP to  this  aim is  to  completely separate  the  encoding
(representation) of domain values (values of a given type) from relation
values. For example, consider the domain of cities: we will store  these
cities as  characters strings  of variable  length and  separately, into
each  relation  which  has  an  attribute  (column)  of  domain  city, a
"reference" (not a pointer, see below) to these values.

This is implicit in the concept of physical and logical independence and
to see an effective use of it is possible to trace back at least to  [4]
(p. 7):

    [..]

    Reference Numbers

    The overall structure of the MADAM system is presented in Figure  1.
    It will be noted that the primary division in the system is  between
    those procedure and data  segments concerned with data  elements and
    those concerned  with relations.  [..] Whenever  a new  data element
    enters the  system, it  is immediately  assigned a  reference number
    which is used for all subsequent operations on that element.

    [..]

This seems to me similar  to some research in "programming  with shapes"
[5], but applied to relations:

    [..]  The  fundamental  basis  for  a  Shapely  computation  is  the
    seperation of shape and data. For example, separating a tree into an
    empty tree and  a list of  the data at  the leaves, or  separating a
    matrix into its dimensions and a list of the entries [..]


A  THIRD  STEP is  to  store these  attributes  as bitmap  indexes  (bit
-vectors), one  for each  domain value  (for example  each city) present
into the  relation (one  column is  mapped to  one or  more bit-vectors,
depending on the number of  values stored in such column).  As explained
in [6]:

    [..]
    1 Introduction

    A bitmap  index is  a bit  string in  which each  bit is mapped to a
    record ID (RID) of a relation. A bit in the  bitmap index is set (to
    1) if the corresponding RID has property P (i.e., the RID represents
    a customer that  lives in New York), and is reset (to 0)  otherwise.
    In typical usage,  the predicate P  is true for  a record if  it has
    the value a for attribute A. One such predicate is associated to one
    bitmap  index  for  each  unique  value  of   the  attribute  A. The
    predicates can be  more complex, for  example bitslice indices  [..]
    and precomputed  complex selection predicates [..].

    [..]

Storing  these  bitmap  indexes  (or  bit-vectors)  in  "raw"  format is
inefficient so we will store this bitmaps compressed.

Compressed  bit-vectors  (CBVs)  on  average  are  short  sequences   of
integers, possibly only one.  For example, a compression  technique used
in [7] is a modified form of run-length encoding. Selecting one of these
CBVs  corresponds  to  a  SQL SELECT  on  a  relation  for a  particular
attribute value.  If this  CBV is  strongly compressed  then this SELECT
corresponds  to the  extraction of  few  integers  even if  the relation
contains MILLIONS OF TUPLES (rows).

Also we will perform  other relational operations, which  are equivalent
to bitwise logic operations on these CBVs, WITHOUT DECOMPRESSING IT,  as
shown in [6] and [7].

Probably  there are  variations of  this implementation  technique  that
deals with domains with  big cardinality or relations  with particularly
"strange" distributions of values attribute. This framework seems to  me
sufficiently general to encompass these edge cases. This last remark is,
finally, the  explanation of  my statement  about minor  necessity, with
this technique, of  big implementation variations  for relations in  the
style  of AP5  [8] (I  was enable  to explain  my rationale  concisely).
Anyway a subject of further research.

Note that in [7]  CBVs are used to  analyze the control flow  graph of a
given program, activity that is usually not perceived as data-mining. At
a careful eye this clearly  shows that relations are not  only perfectly
suitable  to  deal  with  hierarchies  and  networks  (trees and  graphs -  common   misconceptions  notwithstanding)   but  that   with  a  good
implementation is possible to do this efficiently.

Be warn, this is only a part of  a big picture. I am aware that we  need
further research if we want to conciliate Tunes with the relational data
model,  of which this note cover only an aspect of its implementation (I
hope satisfactorily).

Best regards,

Massimo Dentico



REFERENCES

[1] "More On MultiValue Databases" with Fabian Pascal
    - http://www.dbdebunk.com/mvdb_0306.htm

[2] "Object, schmobject...", Jean-Claude Wippler, 1999, equi4 web  site:
    the  rational behind  MetaKit, the  interesting part  is that  about
    getting rid of records.
    - http://www.equi4.com/old/schmoo.html

[3] "Kdb FAQ", Kx Systems
    - http://www.kx.com/product_info/high-performance-database.htm#q2
    An excerpt:
    "[..] What makes kdb a high performance database?
    The main reason is that kdb tables are "inverted", i.e. the data  in
    each column is  stored together, instead  of in the  row orientation
    used by most rdbms's. The column data is also organized for  optimal
    processing speed. The result is an ultra high performance  database.
    [..]"

[4] "The MacAIMS Data Management System", Robert C. Goldstein, Alois  J.
    Strnad, 1971, MIT LCS Technical Memo 24 (MIT-LCS-TM-024)
    - http://www.lcs.mit.edu/publications/specpub.php?id=23

[5] "The Shape Project"
    - http://linus.socs.uts.edu.au/~shape/shape.html

[6] "Optimizing  Queries  On  Compressed  Bitmaps",  Sihem   Amer-Yahia,
    Theodore Johnson, 2000, The VLDB Journal
    - http://citeseer.nj.nec.com/amer-yahia00optimizing.html

[7] "Boolean logic on compressed bitvectors", Koen Van Damme
    - http://users.pandora.be/koen.vandamme1/c_tools/bitplatforms/article.txt
    and "Compressed Bitvectors"
    - http://users.pandora.be/koen.vandamme1/c_tools/bitplatforms/_index.htm

[8] see  AP5  node  on CTO  (Cliki  Tunes  Org, our  Wiki)  for  a short
    explaination of it
    - http://cliki.tunes.org/ap5