In defense of Prevalence
Massimo Dentico
m.dentico@virgilio.it
Thu Feb 12 16:34:02 2004
"Francois-Rene Rideau" <fare@tunes.org>
on Thursday, February 12, 2004 4:18 PM
wrote:
> This is my take on Prevalence, in response to the disparaging comments
> made by RDBMS pundits, as reported by MAD70 on cto.
> http://cliki.tunes.org/Prevalence
So much misconceptions that I don't know where to start from! Well, some
words of warning: I'm far from an appropriate education about data
management in general and the relational data model in particular, but
this needs a replay, the best I can in my actual conditions.
> Prevalence can indeed be seen but as a cheap brittle implementation
> trick to achieve persistence of application data. But that's really
> missing the whole point and the change in perspective that underlies
> it all (be first to it "paradigm shift" and become a pundit). What is
> interesting about prevalence is that it makes explicit a different
> (and retrospectively obvious) factoring of the persistence problem and
> its solution. (And of course, this factoring is precisely what I've
> been working on in my thesis.)
If you like we can call a "paradigm shift" the return to an old
*discarded* technique, but this does not change the substance of the
problem.
> At the heart of any application is the computational model it tries to
> implement: the abstract state space and the (I/O-annotated)
> transitions between states. E.g. an abstract state space made of bank
> accounts status and exchange rates, and transitions being financial
> transactions. E.g. a higher-order code repository, with user commands
> (within contained areas) from the console or wherever, and various
> auxiliary inputs. E.g. whatever you can think of, in nested or
> interacting ways. This computational model is the very essence of the
> application, and anything else is but means to implement this
> application.
You are thinking as a blind *application* programmer (I thought that one
of the points of Tunes is indeed that of getting rid of the
"application" concept as an artificial barrier to the free flow of
information); the disparaging comments come from people expert in the
field of *data management*.
Answer these questions: what is more important for you *as user*, your
applications or your data? Is it not true that you can change
algorithms, applications, data formats, operating systems, hardware but
what *really* interests you are your data? This is true not only for a
single user but even more for entire organizations.
As programmer you values your programs *as data*: you are not interested
to computations as such but to the outcome of these computations AND in
what is capable to produce these computations, the programs encoded in
some way - both are data. You usually don't acquire "computations"; you
acquire computing resources, data and programs.
> Filesystems and RDBMSes provide low-level or mid-level tools to which
> you are asked to explicitly, manually map your application's
> semantics. ...
This is a big misconception: File Systems (FSs) and DBMSs (Relational
or not) are very different solutions to the problem of persistence.
The concept of DBMS arises where that of FS failed. With applications +
files for persistence you have:
1. no guarantee that an application can access data of another
application (or another version of the same application); yes, there are
*some* standards for data formats but the reality is that often data are
bound to specific applications, interoperability is a kimera: CORBA,
DDE/OLE/COM/COM+/DCOM solve only part of the problem. This is directly
related to...
2. no guarantee that an application respects (or is even only aware of)
the integrity of your data.
Now a DBMS *is* a refactoring of the old, problematic approach
(applications + FSms), NOT the other way around: a DBMS factor some
important functionalities (ACID properties for example
- http://en.wikipedia.org/wiki/ACID) for data management that previously
you must replicate in all the programs; if, for lack of education, you
ignore such matters, troubles are most likely.
In particular declarative /integrity constraints/ are a key feature of a
True RDBMS that SQL-based DBMSs don't implement correctly (as I
understand it, the prevalence model lacks completely this concept).
As I have already pointed out on the Tunes review mailing-list, in
"DataBase Debunking: clarifications":
http://lists.tunes.org/archives/review/2002-June/000172.html
This is a key point of a DBMS; as Fabian Pascal explains:
07/21/2001 - On Data Types and Suneido “DBMS”
http://www.dbdebunk.com/page/page/622722.htm
[..] Incidentally, I happen to think that SQL is a bad language,
but unfortunately that's what the industry implemented due to IBM.
My point was that in order to do data management, a DBMS must
support some data model. A data model has the following components
· data types
· structure
· integrity
· manipulation
These are /database/, /not application/ functions and /must/ be
implemented in the DBMS.
Since Suneido does not implement types, it does not have full
support of a data model and, therefore, it is not a full DBMS.[..]
Emphasis in the original. Renouncing one only of these components
means to delegate such component to each specific application, with
possible redundancies, incompatibilities, conflicts, corruption and
loss of data, difficulties in data access, etc .. exactly the
current situation with file system based applications (imagine the
mess on your hard disk, despite any attempt to keep your data
organized) which is comparable to the pre-DBMS era (at least 30
years ago).
> ........... Object persistence attempts at providing tools that
> directly and implicitly map a subset of the constructs of your
> programming language, so (assuming your language runtime and compile
> -time were properly hacked) you have your usual tools to include
> persistence in your application. Well, Prevalence promises persistence
> tailored directly to the application's computational model, without
> requiring a hacked language implementation. The programming language,
> the data schema, the filesystem, the I/O devices are all tools to
> achieve the goal of building a computing system. Pundits of various
> domains may want everyone to express their problems in their domain
> algebra. Prevalence skips these middle-men and focuses on the essence
> of application domain: the state transitions of its computational
> model.
This seems appropriate:
A Data Modeler's Bag of Tricks
by William J. Lewis
http://www.dbazine.com/lewis1.html
[...] All Requirements are Data Requirements
When I started data modeling, to most software developers it was a
strange, arcane kind of black art, viewed with skepticism, and
requiring intense lobbing and proselytizing to achieve its reluctant
adoption within a development project. These days, no development
project gets very far along without a data model--even if disguised
(some might say misappropriated) as an object model.
>>>While many challenges have been overcome, others remain. Contrary to
the prevailing wisdom, a data model[*] (and its equivalent database
implementation) is not just a part of a computer application, an
unfortunate necessity to accommodate this pesky "persistence" thing.
It is the foundation upon which all of the processing is built.
Processing has to happen to something.<<<
Data modelers should take every opportunity to connect with the
business analysts and software developers that they work with, and
get them engaged in data modeling tasks at the very beginning of
every project. Don't wait for them to come to you!
"Show me your flowcharts and conceal your tables, and I shall
continue to be mystified. Show me your tables, and I won't
usually need your flowcharts; they'll be obvious."
Frederick P. Brooks Jr., The Mythical Man-month: Essays on
Software Engineering
[*] probably here he means a conceptual or business model: "a model of
the persistent data of some /particular/ user or organization"; a data
model is "a general theory of data via which enterprise-specific
conceptual models are mapped to logical models e.g. relational data
model", see:
SOMETHING TO CALL ONE'S OWN
by Fabian Pascal
http://www.dbdebunk.com/page/page/622537.htm
> Note how this view of the world is neutral to data representation: if
> your application domain is indeed but a data repository with
> completely arbitrary unrelated modifications being done on it, then a
> data schema will indeed be the best way to model it; but even then,
> Prevalence is a good implementation technique to persist your data
> robustly. In this narrow case, a journal of modifications since day
> one may be a poor way to encode the state of the system, but
> Prevalence doesn't mandate the journal as being the only and main
> representation -- it only provides journal since last full dump as a
> way to robustly recover the latest state, in a way orthogonal to the
> means to dump and restore memory.
> ................................. And in any case, note that there are
> many potential algebras to describe your data; despite what pundits
> say, a relational model needn't be the right one -- few people encode
> the structure of 3D objects or music samples in relational tables.
This merit a comment apart: so you argue that because ".. few people
encode the structure of 3D objects or music samples in relational
tables" then the relational data model "needn't be the right one[sic]"?
Well, that's science! What are the better alternatives? Hierarchical or
network structures (graph theory)? When a data model is "the right one"?
The XML Bug
by Fabian Pascal
http://www.dbazine.com/pascal4.html
[...] Like most trade journalists, who have no understanding of data
fundamentals, he is unaware that XML DBMSs are a throwback to the
old hierarchic DBMSs, predominant decades ago and discarded
precisely because they were complex, inflexible and difficult for
application development, the exact opposite of what is now being
claimed for their XML reincarnation:
Note: Here's an excerpt from the manual for IBM's old IMS hierarchic
DBMS (brought to my attention by Chris Date): "Logically deleting a
logical child prevents further access to the logical child using its
logical parent. Unidirectional logical child segments are assumed to
be logically deleted. A logical parent is considered logically
deleted when all its logical children are physically deleted. For
physically paired logical relationships, the physical child paired
to the logical child must also be physically deleted before the
logical parent is considered logically deleted."
[...]
The hierarchic approach underlying XML does, in fact, have a formal
foundation: graph theory. But as the XML 1.0 specification
explicitly states, it does not adhere to to the theory. The reason
is the same as that for which old hierarchic database management
(e.g., IBM's IMS) eschewed the theory too: it is extremely complex.
What is more, the real world is not thoroughly hierarchic, but the
hierarchic approach can handle only hierarchies. This means that
hierarchy and its complexity must be hoisted upon any and all
database representations, whether justified (e.g. organizational, or
bill-of-material structures) or, as is more often the case, not.
Since the relational approach can handle both non-hierarchic and
hierarchic data in a formal and much simpler way (see Chapter 7 in
my Practical Issues in Database Management), what exactly is the
advantage of the hierarchic approach underlying XML?
[...]
About problems with hierarchical and network DBMSs see:
A Relational Model of Data for Large Shared Data Banks
by E. F. Codd
http://www.acm.org/classics/nov95/
in particular "1.2 Data Dependence in Present Systems":
1.2.1. Ordering Dependence
1.2.2. Indexing Dependence
1.2.3. Access Path Dependence
Also about the subject of modelling:
MODELS, MODELS, EVERYWHERE, NOR ANY TIME TO THINK
by Chris Date
http://www.dbdebunk.com/page/page/622923.htm
> With prevalence, you don't have to fit your schema to a specific
> algebra for which robust persistence was implemented, you choose
> whichever representation algebra you wish -- and maybe even several
> different representations on each mirror so as to accomodate for
> different kinds of queries. And no, it doesn't have to be an "In
> Memory" representation; it could be file-based, too. It just has to be
> a one isolated world kept coherent from meddling through anything but
> journalled transactions.
I suggest to read:
"COMPLEX" DATA TYPES: WHAT OBJECT PROPONENTS DON'T TELL YOU
http://www.pgro.uk7.net/fp1a.htm
(Note: this point to the old site, there isn't a copy on the new site)
The Problem
The data type concept is one of the concepts least understood by
database practitioners. ***This is both a cause and a consequence of
the failure by SQL and its commercially available dialects to
implement relational domains, which are nothing but data types of
arbitrary complexity.*** It is for this reason that proponents of
object orientation can claim with impunity that the relational
approach doesn't support so-called "complex" data types or
"unstructured" data, and therefore object DBMSs (ODBMS) are superior
to relational DBMSs (RDBMS). This article explains the data type
concept, what DBMS support of data types means, and the distinction
between "simple" and "complex" data types.
[...]
Emphasis mine.
And also:
Persistence Not Orthogonal to Type
by C.J.Date
(http://www.dbpd.com/vault/9810/date.html)
[..]
POTT [Persistence Orthogonal To Type] Violates Data Independence
[..]
- The object model says we can put anything we like in the database
(any data structure we can create with the usual programming
language mechanisms).
- The relational model effectively says the same thing--but then
goes on to insist that whatever we do put there be presented to the
user in pure relational form.
More precisely, the relational model, quite rightly, says nothing
about what can be physically stored. It therefore imposes no limits
on what data structures are allowed at the physical level; ***the
only requirement is that whatever structures are physically stored
must be mapped to relations at the logical level and be hidden from
the user***. Relational systems makes a clear distinction between
logical and physical (that is, between the model and its
implementation), while object systems don't.
One consequence of this state of affairs is that, as already
claimed--but contrary to conventional wisdom--object systems might
very well provide less data independence than relational systems do.
For example, suppose the implementation in some object database of
the object EX mentioned earlier (denoting the collection of
employees in a given department) is changed from an array to a
linked list. What are the implications for existing code that
accesses that object EX? It breaks.
[..]
POTT Causes Additional Complexity
It should be obvious that POTT does lead to additional complexity-
-and by "complexity" here I mean, primarily, complexity for the
user, although life does get more complex for the system too. For
example, the relational model supports just one "collection type
generator," RELATION, together with a set of operators--join,
project, and so forth--that apply to all "collections" of that type
(in other words, to all relations). In contrast, the ODMG proposals
support four collection type generators, SET, BAG, LIST, and ARRAY,
each with a set of operators that apply to all collections of the
type in question. And I would argue that the ODMG operators are
simultaneously more complicated and less powerful than the analogous
relational ones.
[..]
Again, emphasis mine. "POTT Causes Additional Complexity" reminds me of:
"Epigrams in Programming" by Alan J. Perlis
http://www.cs.yale.edu/homes/perlis-alan/quotes.html
9. It is better to have 100 functions operate on one data structure
than 10 functions on 10 data structures.
> But more importantly, Prevalence makes you focus on the dynamics of
> the application. It makes you think about the atomic transformations
> of your system. You may have many representations and change
> representations with time; prevalence makes you realize that the
> lingua franca between all these possible implementations will be the
> very natural language of your system -- the language of its state
> transitions. And incidentally, when you make formal proofs of
> declarative properties about a system (as I have done in the past for
> security invariants), this is precisely the kind of things you do:
> consider the possible transitions of the system. Prevalence is all
> about realizing that the persistence of application data, as well as
> about anything else in an application, is better factored around the
> abstract computation model of the application than around the low
> -level operations of data-storage (be it files or tables) or other
> computing tricks.
WHAT?? A general, mathematical theory of data is certainly NOT related
to "low-level operations of data-storage" nor to "other computing
tricks". You are describing prevalence here, not the relational data
model. Of course you suffer "The Logical-Physical Confusion":
The Logical-Physical Confusion
by Fabian Pascal
http://www.inconcept.com/JCM/May2002/pascal.html
[...] A confusion between logical and physical levels of data
representation has always plagued the database field. It usually is
implicit in most of what is said and done in the field and,
therefore, not readily discerned by the uneducated eye -- read: most
of the industry [...]
> As for the brittleness of using Prevalence through simple libraries
> such as Prevayler or Common Lisp Prevalence that rely on your manually
> matching your journal structure to meaningful operations, well, mind
> that the coherence can be enforced with a simple discipline, and that
> most importantly, this discipline can be largely automated -- which is
> what object databases do, in a way. Actually, the essence of
> Prevalence being but the factoring of robust persistence through
> journalling, we can see all (or only most?) robustly persistent
> implementations of any model as using prevalence internally, only in
> an ad-hoc way fit to said model. Such software infrastructure may
> relieve developers from having to manually enforce coherence of the
> persistence layers, but then again, it imposes upon them to manually
> enforce coherence of the mapping of their application to the
> persistence layer; this is but a displacement, which might or might
> not be a net gain, but at least isn't the clear-cut gain-without-cost
> that specific-persistence-method pundits would have us believe.
>
> As for applicatibility of Prevalence ideas to TUNES, I think that the
> factoring brought by Prevalence is an essential conceptual tool, and a
> nice opportunity to use reflection. Reflection can be used to
> automatically enforce coherence of the journal structure with the
> application structure, while making things explicit at one place and
> then implicit the rest of the time. My bet is that Prevalence plus
> reflection can achieve cheaply what expensive object databases
> provide, in a way that is tailored to an application. Reflection is a
> tool that allows to take advantage of a good meta-level factoring of
> software. Prevalence is part of such a factoring.
Again the logical-physical confusion and again this certainty that to
use the relational data model you need this "mapping" from an "internal"
representation (at the programming language level) to the "external"
representation (at the persistence store level), the (in-)famous
"impedance mismatch". Well, that's just an artifact of current
implementations.
Hugh Darwen and C.J. Date in a book, "The Third Manifesto", show another
approach, a complete integration between the relational data model and a
theory of types with a proper concept of "inheritance" (this term IMO is
an inappropriate choice, too confused; subtyping is better). This
approach is incarnate in a *pedagogical* programming language, Tutorial
D.
In the SIGMOD Record Volume 24, Number 1, March 1995, there is a succint
preview of this book content:
http://www.acm.org/sigmod/record/issues/9503/
o directly as postscript:
http://www.acm.org/sigmod/record/issues/9503/manifesto.ps
There is also a commercial product inspired by Tutorial D, Dataphor (but
currently they haven't a True RDBMS, so they automatically map to
available SQL DBMSs).
http://dataphor.com/
I cannot testify for this product because I never used it, so take this
as a mere evidence of feasibility of the approach.
Best regards.
--
Massimo Dentico