In defense of Prevalence

Massimo Dentico m.dentico@virgilio.it
Thu Feb 12 16:34:02 2004


"Francois-Rene Rideau" <fare@tunes.org>
on Thursday, February 12, 2004 4:18 PM
wrote:

> This is my take on Prevalence, in response to the disparaging comments
> made by RDBMS pundits, as reported by MAD70 on cto.
> http://cliki.tunes.org/Prevalence

So much misconceptions that I don't know where to start from! Well, some
words  of warning:  I'm  far  from an  appropriate education  about data
management in general and the  relational data model in particular,  but
this needs a replay, the best I can in my actual conditions.


> Prevalence can indeed  be seen but  as a cheap  brittle implementation
> trick to achieve  persistence of application  data. But that's  really
> missing the whole point and  the change in perspective that  underlies
> it all (be first to it "paradigm shift" and become a pundit). What  is
> interesting about  prevalence is  that it  makes explicit  a different
> (and retrospectively obvious) factoring of the persistence problem and
> its solution. (And  of course, this  factoring is precisely  what I've
> been working on in my thesis.)

If  you  like we  can  call a  "paradigm  shift" the  return  to an  old
*discarded* technique,  but this  does not  change the  substance of the
problem.


> At the heart of any application is the computational model it tries to
> implement:   the  abstract   state  space   and  the   (I/O-annotated)
> transitions between states. E.g. an abstract state space made of  bank
> accounts status  and exchange  rates, and  transitions being financial
> transactions. E.g. a higher-order code repository, with user  commands
> (within contained  areas) from  the console  or wherever,  and various
> auxiliary  inputs.  E.g.  whatever  you can  think  of,  in  nested or
> interacting ways. This computational model is the very essence of  the
> application,  and  anything  else  is  but  means  to  implement  this
> application.

You are thinking as a blind *application* programmer (I thought that one
of  the  points  of  Tunes  is  indeed  that  of  getting  rid  of   the
"application"  concept as  an artificial  barrier to  the free  flow of
information); the disparaging  comments come from  people expert in  the
field of *data management*.

Answer these questions: what is  more important for you *as  user*, your
applications  or  your  data?  Is  it  not  true  that  you  can  change
algorithms, applications, data formats, operating systems, hardware  but
what *really* interests you are your  data? This is true not only  for a
single user but even more for entire organizations.

As programmer you values your programs *as data*: you are not interested
to computations as such but to the outcome of these computations AND  in
what is capable to produce  these computations, the programs encoded  in
some way - both are data. You usually don't acquire "computations";  you
acquire computing resources, data and programs.


> Filesystems and RDBMSes provide low-level or mid-level tools to  which
> you  are  asked  to   explicitly,  manually  map  your   application's
> semantics. ...

This is a big misconception:  File Systems (FSs) and  DBMSs  (Relational
or not) are very different solutions to the problem of persistence.

The concept of DBMS arises where that of FS failed. With applications  +
files for persistence you have:

1.  no  guarantee  that  an  application  can  access  data  of  another
application (or another version of the same application); yes, there are
*some* standards for data formats but the reality is that often data are
bound to  specific applications,  interoperability is  a kimera:  CORBA,
DDE/OLE/COM/COM+/DCOM solve only part  of the problem. This  is directly
related to...

2. no guarantee that an application respects (or is even only aware  of)
the integrity of your data.

Now  a  DBMS  *is*  a  refactoring  of  the  old,  problematic  approach
(applications +  FSms), NOT  the other  way around:  a DBMS  factor some
important    functionalities     (ACID    properties     for    example
- http://en.wikipedia.org/wiki/ACID) for data management that previously
you must replicate  in all the  programs; if, for lack of education, you
ignore such matters, troubles are most likely.

In particular declarative /integrity constraints/ are a key feature of a
True  RDBMS  that  SQL-based  DBMSs  don't  implement  correctly  (as  I
understand it, the prevalence model lacks completely this concept).

As  I have  already pointed  out on  the Tunes  review mailing-list,  in
"DataBase Debunking: clarifications":
http://lists.tunes.org/archives/review/2002-June/000172.html

    This is a key point of a DBMS; as Fabian Pascal explains:

      07/21/2001 - On Data Types and Suneido “DBMS”
      http://www.dbdebunk.com/page/page/622722.htm

      [..] Incidentally,  I happen to think that SQL is a bad  language,
      but unfortunately that's what the industry implemented due to IBM.

      My point  was that  in order  to do  data management,  a DBMS must
      support some data model. A data model has the following components

      · data types
      · structure
      · integrity
      · manipulation

      These are  /database/, /not  application/ functions  and /must/ be
      implemented in the DBMS.

      Since Suneido  does not  implement types,  it does  not have  full
      support of a data model and, therefore, it is not a full DBMS.[..]

    Emphasis in the  original. Renouncing one  only of these  components
    means to delegate such component to each specific application,  with
    possible redundancies, incompatibilities, conflicts, corruption  and
    loss  of  data, difficulties  in  data access,  etc  .. exactly  the
    current situation with file  system based applications (imagine  the
    mess  on your  hard disk,  despite  any  attempt to  keep your  data
    organized) which  is comparable  to the  pre-DBMS era  (at least  30
    years ago).


> ........... Object  persistence  attempts  at  providing  tools   that
> directly  and  implicitly  map  a subset  of  the  constructs  of your
> programming language, so (assuming  your language runtime and  compile
> -time  were properly  hacked) you  have your  usual tools  to include
> persistence in your application. Well, Prevalence promises persistence
> tailored directly  to the  application's computational  model, without
> requiring a hacked language implementation. The programming  language,
> the data  schema, the  filesystem, the  I/O devices  are all  tools to
> achieve the goal  of building a  computing system. Pundits  of various
> domains may want  everyone to express  their problems in  their domain
> algebra. Prevalence skips these middle-men and focuses on the  essence
> of  application domain:  the state  transitions of  its computational
> model.

This seems appropriate:

A Data Modeler's Bag of Tricks
by William J. Lewis
http://www.dbazine.com/lewis1.html

    [...] All Requirements are Data Requirements

    When I started data modeling,  to most software developers it  was a
    strange,  arcane  kind of  black  art, viewed  with  skepticism, and
    requiring intense lobbing and proselytizing to achieve its reluctant
    adoption within  a development  project. These  days, no development
    project gets very far along without a data model--even if  disguised
    (some might say misappropriated) as an object model.

 >>>While many challenges have been overcome, others remain. Contrary to
    the prevailing wisdom, a data model[*] (and its equivalent  database
    implementation) is  not just  a part  of a  computer application, an
    unfortunate necessity to accommodate this pesky "persistence" thing.
    It is  the foundation  upon which  all of  the processing  is built.
    Processing has to happen to something.<<<

    Data  modelers should  take every  opportunity to  connect with  the
    business analysts and software  developers that they work  with, and
    get them  engaged in  data modeling  tasks at  the very beginning of
    every project. Don't wait for them to come to you!

        "Show me your  flowcharts and conceal  your tables, and  I shall
        continue  to be  mystified. Show  me your  tables, and  I won't
        usually need your flowcharts; they'll be obvious."
        Frederick  P.  Brooks  Jr., The  Mythical  Man-month:  Essays on
        Software Engineering

[*] probably here he means a  conceptual or business model: "a model  of
the persistent data of some  /particular/ user or organization"; a  data
model  is  "a  general  theory  of  data  via  which enterprise-specific
conceptual  models are  mapped to  logical models  e.g. relational  data
model", see:

SOMETHING TO CALL ONE'S OWN
by Fabian Pascal
http://www.dbdebunk.com/page/page/622537.htm


> Note how this view of the world is neutral to data representation:  if
> your  application  domain  is  indeed  but  a  data  repository   with
> completely arbitrary unrelated modifications being done on it, then  a
> data schema will indeed  be the best way  to model it; but  even then,
> Prevalence is  a good  implementation technique  to persist  your data
> robustly. In this  narrow case, a  journal of modifications  since day
> one  may  be  a poor  way  to  encode the  state  of  the system,  but
> Prevalence doesn't  mandate the  journal as  being the  only and  main
> representation -- it only provides  journal since last full dump  as a
> way to robustly recover the latest  state, in a way orthogonal to  the
> means to dump and restore memory.

> ................................. And in any case, note that there are
> many potential algebras  to describe your  data; despite what  pundits
> say, a relational model needn't be the right one -- few people  encode
> the structure  of 3D  objects or  music samples  in relational tables.

This merit a  comment apart: so  you argue that  because ".. few  people
encode  the  structure of  3D  objects or  music  samples in  relational
tables" then the relational data model  "needn't be the right one[sic]"?
Well, that's science!  What are the better alternatives? Hierarchical or
network structures (graph theory)? When a data model is "the right one"?

The XML Bug
by Fabian Pascal
http://www.dbazine.com/pascal4.html

    [...] Like most trade journalists, who have no understanding of data
    fundamentals, he is  unaware that XML  DBMSs are a  throwback to the
    old  hierarchic  DBMSs,   predominant  decades  ago   and  discarded
    precisely because  they were  complex, inflexible  and difficult for
    application development,  the exact  opposite of  what is  now being
    claimed for their XML reincarnation:

    Note: Here's an excerpt from the manual for IBM's old IMS hierarchic
    DBMS (brought to my attention by Chris Date): "Logically deleting  a
    logical child prevents further access to the logical child using its
    logical parent. Unidirectional logical child segments are assumed to
    be  logically  deleted.  A logical  parent  is  considered logically
    deleted when all  its logical children  are physically deleted.  For
    physically paired logical  relationships, the physical  child paired
    to the  logical child  must also  be physically  deleted before  the
    logical parent is considered logically deleted."

    [...]

    The hierarchic approach underlying XML does, in fact, have a  formal
    foundation:  graph  theory.  But   as  the  XML  1.0   specification
    explicitly states, it does not  adhere to to the theory.  The reason
    is the  same as  that for  which old  hierarchic database management
    (e.g., IBM's IMS) eschewed the theory too: it is extremely complex.

    What is more, the real  world is not thoroughly hierarchic,  but the
    hierarchic approach  can handle  only hierarchies.  This means  that
    hierarchy  and  its complexity  must  be hoisted  upon  any and  all
    database representations, whether justified (e.g. organizational, or
    bill-of-material structures)  or, as  is more  often the  case, not.
    Since the  relational approach  can handle  both non-hierarchic  and
    hierarchic data in a formal and  much simpler way (see Chapter 7  in
    my Practical  Issues in  Database Management),  what exactly  is the
    advantage of the hierarchic approach underlying XML?

    [...]

About problems with hierarchical and network DBMSs see:
A Relational Model of Data for Large Shared Data Banks
by E. F. Codd

http://www.acm.org/classics/nov95/

in particular "1.2 Data Dependence in Present Systems":
   1.2.1. Ordering Dependence
   1.2.2. Indexing Dependence
   1.2.3. Access Path Dependence

Also about the subject of modelling:

MODELS, MODELS, EVERYWHERE, NOR ANY TIME TO THINK
by Chris Date
http://www.dbdebunk.com/page/page/622923.htm


> With  prevalence, you  don't have  to fit  your schema  to a  specific
> algebra  for  which  robust persistence  was  implemented,  you choose
> whichever representation algebra  you wish --  and maybe even  several
> different  representations  on each  mirror  so as  to  accomodate for
> different kinds  of queries.  And no,  it doesn't  have to  be an  "In
> Memory" representation; it could be file-based, too. It just has to be
> a one isolated world kept coherent from meddling through anything  but
> journalled transactions.

I suggest to read:

"COMPLEX" DATA TYPES: WHAT OBJECT PROPONENTS DON'T TELL YOU
http://www.pgro.uk7.net/fp1a.htm
(Note: this point to the old site, there isn't a copy on the new site)

    The Problem

    The data  type concept  is one  of the  concepts least understood by
    database practitioners. ***This is both a cause and a consequence of
    the  failure  by  SQL and  its  commercially  available dialects  to
    implement relational domains,  which are nothing  but data types  of
    arbitrary  complexity.*** It is for this reason  that  proponents of
    object  orientation  can  claim with  impunity  that  the relational
    approach  doesn't   support  so-called   "complex"  data   types  or
    "unstructured" data, and therefore object DBMSs (ODBMS) are superior
    to relational  DBMSs (RDBMS).  This article  explains the  data type
    concept, what DBMS support of data types means, and the  distinction
    between "simple" and "complex" data types.

    [...]

Emphasis mine.

And also:

Persistence Not Orthogonal to Type
by C.J.Date
(http://www.dbpd.com/vault/9810/date.html)

    [..]

    POTT [Persistence Orthogonal To Type] Violates Data Independence

    [..]

    - The object model says we can put anything we like in the  database
    (any  data  structure  we  can  create  with  the  usual programming
    language mechanisms).

    - The  relational model  effectively says  the same  thing--but then
    goes on to insist that whatever we do put there be presented to  the
    user in pure relational form.

    More precisely,  the relational  model, quite  rightly, says nothing
    about what can be physically stored. It therefore imposes no  limits
    on what data structures  are allowed  at the physical level;  ***the
    only requirement is that  whatever structures are physically  stored
    must be mapped to relations at the logical level and be hidden  from
    the user***.  Relational systems  makes a  clear distinction between
    logical  and  physical   (that  is,  between   the  model  and   its
    implementation), while object systems don't.

    One  consequence  of  this  state of  affairs  is  that,  as already
    claimed--but contrary to  conventional wisdom--object systems  might
    very well provide less data independence than relational systems do.
    For example, suppose the  implementation in some object  database of
    the  object  EX  mentioned  earlier  (denoting  the  collection   of
    employees  in a  given department)  is changed  from an  array to  a
    linked  list.  What  are the  implications  for  existing code  that
    accesses that object EX? It breaks.

    [..]

    POTT Causes Additional Complexity

    It should be obvious that  POTT does lead to additional  complexity-
    -and  by "complexity"  here I  mean, primarily,  complexity for  the
    user, although life  does get more  complex for the  system too. For
    example, the  relational model  supports just  one "collection  type
    generator,"  RELATION,  together  with  a  set  of  operators--join,
    project, and so forth--that apply to all "collections" of that  type
    (in other words, to all relations). In contrast, the ODMG  proposals
    support four collection type generators, SET, BAG, LIST, and  ARRAY,
    each with a set  of operators that apply  to all collections of  the
    type in  question. And  I would  argue that  the ODMG  operators are
    simultaneously more complicated and less powerful than the analogous
    relational ones.

    [..]

Again, emphasis mine. "POTT Causes Additional Complexity" reminds me of:

"Epigrams in Programming" by Alan J. Perlis
http://www.cs.yale.edu/homes/perlis-alan/quotes.html

    9. It is better to have 100 functions operate on one data  structure
    than 10 functions on 10 data structures.



> But more importantly,  Prevalence makes you  focus on the  dynamics of
> the application. It makes  you think about the  atomic transformations
> of  your  system.  You  may  have  many  representations  and   change
> representations  with  time;  prevalence makes  you  realize  that the
> lingua franca between all  these possible implementations will  be the
> very natural  language of  your system  -- the  language of  its state
> transitions.  And  incidentally,  when  you  make  formal  proofs   of
> declarative properties about a system (as I have done in the past  for
> security invariants),  this is  precisely the  kind of  things you do:
> consider the  possible transitions  of the  system. Prevalence  is all
> about realizing that the persistence  of application data, as well  as
> about anything else in an  application, is better factored around  the
> abstract  computation model  of the  application than  around the  low
> -level operations  of data-storage  (be it  files or  tables) or other
> computing tricks.

WHAT?? A general, mathematical theory  of data is certainly NOT  related
to  "low-level  operations   of data-storage"  nor  to  "other computing
tricks". You  are describing  prevalence here,  not the  relational data
model. Of course you suffer "The Logical-Physical Confusion":

The Logical-Physical Confusion
by Fabian Pascal
http://www.inconcept.com/JCM/May2002/pascal.html

    [...]  A  confusion  between logical  and  physical  levels of  data
    representation has always plagued the database field. It usually  is
    implicit  in  most  of what  is  said  and done  in  the  field and,
    therefore, not readily discerned by the uneducated eye -- read: most
    of the industry [...]



> As for the  brittleness of using  Prevalence through simple  libraries
> such as Prevayler or Common Lisp Prevalence that rely on your manually
> matching your journal structure  to meaningful operations, well,  mind
> that the coherence can be enforced with a simple discipline, and  that
> most importantly, this discipline can be largely automated -- which is
> what  object  databases  do,  in  a  way.  Actually,  the  essence  of
> Prevalence  being  but  the factoring  of  robust  persistence through
> journalling,  we  can  see all  (or  only  most?) robustly  persistent
> implementations of any model  as using prevalence internally,  only in
> an ad-hoc  way fit  to said  model. Such  software infrastructure  may
> relieve developers from  having to manually  enforce coherence of  the
> persistence layers, but then again,  it imposes upon them to  manually
> enforce  coherence  of  the  mapping  of  their  application  to   the
> persistence layer; this  is but a  displacement, which might  or might
> not be a net gain, but at least isn't the clear-cut  gain-without-cost
> that specific-persistence-method pundits would have us believe.
>
> As for applicatibility of Prevalence ideas to TUNES, I think that  the
> factoring brought by Prevalence is an essential conceptual tool, and a
> nice  opportunity  to  use  reflection.  Reflection  can  be  used  to
> automatically  enforce coherence  of the  journal structure  with the
> application structure, while making  things explicit at one  place and
> then implicit the  rest of the  time. My bet  is that Prevalence  plus
> reflection  can  achieve  cheaply  what  expensive  object   databases
> provide, in a way that is tailored to an application. Reflection is  a
> tool that allows to take  advantage of a good meta-level  factoring of
> software. Prevalence is part of such a factoring.

Again the logical-physical  confusion and again  this certainty that  to
use the relational data model you need this "mapping" from an "internal"
representation (at  the programming  language level)  to the  "external"
representation  (at  the  persistence  store  level),  the   (in-)famous
"impedance  mismatch".  Well,  that's   just  an  artifact  of   current
implementations.

Hugh Darwen and C.J. Date in a book, "The Third Manifesto", show another
approach, a complete integration between the relational data model and a
theory of types with a proper concept of "inheritance" (this term IMO is
an  inappropriate  choice,  too  confused;  subtyping  is  better). This
approach is incarnate in a *pedagogical* programming language,  Tutorial
D.

In the SIGMOD Record Volume 24, Number 1, March 1995, there is a succint
preview of this book content:

http://www.acm.org/sigmod/record/issues/9503/

o directly as postscript:

http://www.acm.org/sigmod/record/issues/9503/manifesto.ps

There is also a commercial product inspired by Tutorial D, Dataphor (but
currently  they  haven't a  True  RDBMS, so  they  automatically map  to
available SQL DBMSs).

http://dataphor.com/

I cannot testify for this product because I never used it, so take  this
as a mere evidence of feasibility of the approach.

Best regards.

--
Massimo Dentico