Google Summer of Code

Fri Jun 10 00:59:24 PDT 2005

Note: my response is an half-joke. I hope that readers can appreciate
the joky part. As usual, excuse my horrible English.

"Robin Green" <greenrd at greenrd.org> wrote:

> On Fri, Jun 10, 2005 at 02:33:51AM +0200, Massimo Dentico wrote:
>
> In that thread you reprint Date saying:
>
>     "For example, suppose the  implementation in some object  database of
>     the  object  EX  mentioned  earlier  (denoting  the  collection   of
>     employees  in a  given department)  is changed  from an  array to  a
>     linked  list.  What  are the  implications  for  existing code  that
>     accesses that object EX? It breaks."
>
> which makes him look frankly ridiculous.
>
> Yes, you might run in to that problem, of course. But only if you haven't
> programmed to an interface which is common to both array-lists and linked
> lists.

So, you implement a common protocol (interface) for *all* your collection
types and you end up with... what? If it is sufficient general then this
protocol is no more powerful than the Relational Algebra or the Tuple/Domain
Calculi (this is not exactly true, RA is not Turing-equivalent on purpose).
You are inventing a square wheel: *this* is frankly ridiculous.

> It is silly to claim that OO does not separate "physical" and "logical",
> because that is precisely what e.g. access control (i.e. "private" and "public"
> etc.) does.

You are silly to claim that OO does separate "physical" and "logical",
but of course I don't blame you, it is the outcome of current "education"
in this field.

Please, direct your browser at this URI and educate yourself,
"Does OOP really separate interface from implementation?":

    http://pobox.com/~oleg/ftp/Computation/Subtyping/

Note that the Liskov Substitution Principle (LSP), that I called harshly
"sacred" on another mailing-list, was discussed by Date also. When Oleg
Kiselyov wrote that "a Set is not a Bag" in his example, he bases his conclusion
on the implicit assumption that OO languages have "pointers semantics"
(I lack a better term in this moment), or that Objects possess a Unique
IDentifier (OID). With a pure "values semantics" there are no problems
and "a Set is a Bag" (or a Circle is an Ellipse), that is it is true
the intuitive "subtyping is equivalent to subsetting", as shown by Date.

> > 1) the "network DB" part: accessing data by "navigation" is more
> >   difficult and less efficient than accessing them by value
>
> Agreed. However, prevalence does not stop you implementing indexing.
> You can implement indexes however you like, unlike with SQL databases
> where you typically end up relying on their dumb, simple implementations.
> Code will surely be reused for this if prevalence becomes really popular.

?? This has nothing to do with indexing. When you establish a *fixed*
access path to data (with web of pointers in your graph of objects)
you are in troubles. See "1.2.3. Access Path Dependence" in "A Relational
Model of Data for Large Shared Data Banks" by E. Codd:

    http://www.acm.org/classics/nov95/toc.html

The fact that you construct shortcuts (indexes) does *not* solves the
problem.

Moreover, SQL DBMSs indexes implementations are not so dumb as you
think.

> Obviously, there will be a "reinventing the wheel" stage, but this is
> a temporary phenomenon, a bit like the effort needed to "clone" a
> proprietary system into open source software. Once it's done, it's done
> (more or less).
>
> But, I hear you cry, why reinvent the wheel in the first place?
>
> It's not for everybody (too much work). But prevalence has its advantages
> ... OO model;

So called "OO" is more hype than substance: the term does not have
a single, clear, agreed upon definition; this is revealing. There
are varieties of OO languages that have important differences (think
about classes vs prototypes, for example).

> cuts out a chunk of slow bloat with its attendant possibility
> for annoying/incomprehensible errors like ERROR -92; easier to customise data
> storage to optimize for storage space or access time;
> you may get serialization for free ... etc. And last but by no means least,
> no more hideous inpenetrable 10 line long SQL clauses.

Amazing: 5 lines and so much fog. I have no time to waste, so
I respond only to the last statement:

a) SQL is an horrible language, not properly designed; Codd himself was
   probably the first to warn of its deficiencies; RDBMS is not the
   same as SQL DBMS and with the new version of SQL standard this is
   even more true (I have read that they reintroduced *pointers*);
b) are 100/500/1000 lines of functionally equivalent procedural-navigational
   code more "penetrable" than that 10 lines of SQL? I doubt it.
   Use a good declarative language and that 10 lines reduce to 2/3
   or even less, see KDB

       http://kx.com/news/in-the-news/pr-041228-vector.php

   in particular the comparison of SQL and Q at the bootom of the page:

       http://kx.com/q/e/tpcd.s (SQL code)
       http://kx.com/q/e/tpcd.q (q code)

> > 2) the "*without* MS" part: it put us back to pre-DBMS era, where
> >   each application dictate its own file format
>
> Er, with DBMSs, each app can (and sometimes does) dictate its own table format.

Of course, you fail to distinguish the physical and the logical level.
Nothing more wrong: a relational schema is not like a file format.

> Not surprisingly, this is not a big problem in practice. Although mapping
> between object models can be slightly more complicated than mapping between
> relational schemas, I'm not convinced that it's typically
> "difficult or impossible". Can you put forward a specific example?

I was not speaking about schema reconciliation (merging 2, or more, different
data bases, developed independently, with overlapping content). I was speaking
about applications that access the same data base, eventually contributing
each its original data and accessing data of other applications.

Now, when 2 or more concurrent Prevalence applications want to share
data, how they does do this? I guess: each application speak with each
other application (RPCs, named pipes, sockets, whatever..) and every
application multiplexes messages to the appropriate objct(s) in its
address space. In simple terms, they emulate the Actor model.

Now, here arises the problem (see below): how do you query your objects?
I mean, application A is written without knowledge of application B and
vice-versa. How does they exchange data without previously agreeing upon
a common interface (protocol) for each class of interest? And if you don't
know what are the data (classes) of interest for future applications,
what do you do? Do you limit data access? Do you expose all classes?
Do you write wrappers?

> >   NO integrity assurance
>
> Anything that's Turing-complete can do anything that anything else can do.

The usual misuse of Turing-equivalence: so, why not going back to
program in assembly? No, better, in machine language?

> But what if app A doesn't know about app B's integrity constraints?
>
> Hey, that's what ENCAPSULATION is for! You encapsulate the integrity assurance
> (yes, assurance) INSIDE the classes (or groups of classes, often called
> "components", "packages" etc.), so that it is impossible for
> other encapsulation-respecting classes to break integrity rules.
>
> But I see you covered that below - why then, did you say here there is, and I quote,
> "NO integrity assurance".

So, you have read what I have written but you have not understood
and/or appreciated it. I will repeat myself.

NO integrity assurance, period! If it is not pratical, most of the time
YOU DON'T EVEN PERCEIVE THE PROBLEM OF INTEGRITY ASSURANCE!!!

Even SQL users not educated on fundamentals destroy every hope of
integrity assurance with denormalized schemas, duplicates (tables
without keys -- which *are* not relations), NULLs, and so on.
At this point, expressing integrity constraints is so difficult,
even declaratively, that it is not done at all.

See also below.

> >   The fact that with Prevalence you don't manage "files of data"
> >   in the usual sense is not relevant at all here: you end up in
> >   exactly the same situation as with the CRAP on our hard disks
> >   even these days, OLE/COM/DCOM/CORBA, etc..
>
> Guess what? If I'm a dumb-as-rocks piece of software (but that's redundant,
> almost all software is dumb-as-rocks) and I don't know how to
> interpret your tables - I can't make sense of your database, either.

I know how to arbitrary query my relations simply knowing their schema
(if I don't know, I ask the catalog -- aka, metadata) and it is possible
to program generic tools, simple enough that even a motivated end user
can construct arbitrary reports.

How do you query your objects without an agreed upon protocol?
Do you consult the interface (protocol) of each class? Do you let
your end users to do this and write procedural code to print a simple,
not pre-programmed, report? Do you present your objects to the user and
say "look, this is the interface". Are they happy?

**Are their data safe**? See "ON DECLARATIVE INTEGRITY SUPPORT
AND DATAPHOR", by Fabian Pacal:

    http://www.dbdebunk.com/page/page/622317.htm

    DH: [..] I've been rewriting a portion of a large Smalltalk application
    to use Dataphor, and I've been stunned to see just how much application
    code disappears when you have a DBMS that supports declarative integrity
    constraints. In some classes, over 90% of the methods became unnecessary.

    Fabian Pascal: I don’t know why you are stunned. Anybody who knows and
    understands data fundamentals should expect this. Database definition/schema
    is nothing but integrity constraints and the sum total of integrity
    constraints is the best approximation to what the database means to
    the DBMS. Declare those and applications will only need to deal with
    communication with, and presentation of results to the user.

Of course, users needs to understand their data (relating such data
to the "real world") to make sense of and use it, but this is more
or less a superset of the formal "understanding" that a DBMS has
about data.

> > The problem is that for integrity assurance without a DBMS,
> > the programs that interact with your data need to "understand"
> > data semantics (meaning).
>
> Actually they don't. Encapsulation.

Encapsulation does *not* solve the problem as previously stated;
but even if it has solved the problem in theory, it is not pratical
to assure integrity procedurally.

> > So-called OO "encapsulation" is a way
> > to assure integrity, but is an error-prone, brittle method (pun
> > intended) beacuse procedural in nature. Most of the time is from
> > hard to impossible to check procedurally a complex network of
> > constraints, because is like to write, on a per program base,
> > an ad-hoc constraints solver
>
> So librify it! Write a constraints solver once and reuse it! What
> is the problem here?

Do you understand that you are constructing, with your "librify it",
"what is the problem", a DBMS, bit by bit?

> The overuse of arguments amounting to "I know of no freely available
> code to do X in Y in a suitable way, therefore Y is inferior" is a
> big problem in computing. Yes, of course, those arguments are
> important for pragmatic reasons for day-to-day software development
> - but they are contingent, historically-specific arguments, as it were.
>
> I see your overall point here is you don't see that the
> benefits justify the cost of all this wheel-reinvention.

Yes, it seems that you understand at least this.

Much ado for nothing; a varation of the Philip Greenspun's Tenth Rule
of Programming applies here:

    Any sufficiently complicated C or Fortran program contains an
    ad hoc informally-specified bug-ridden slow implementation of
    half of Common Lisp.

Substitue "Common Lisp" with "True RDBMS" and you have done.

> I disagree, although I don't of course think prevalence is suitable
> for everything.
> I find it a bit annoyingly limited and clunky at present. (But that's
> a contingent, historically-specific, pragmatic point I'm making, not
> a telling blow against prevalence forever more.)

Of course, let's go to "perfection" it and we will see if we will
approach the most stupid DBMS.

> > (Note that imperative OO languages with declarative constraints,
> > aka Constraint Imperative Programming (CIP) languages, exist;
> > see the Kaleidoscope family: http://cliki.tunes.org/Kaleidoscope )
> >
> > The advantage of the relational data model is that it offers a clean,
> > simple medium to express arbitrary structures, manipulations and
> > integrity, all in a declarative way.
>
> But why should I necessarily bother to distort my object graphs into
> relational structures, just to take advantage of this "clean, simple
> medium"?

I cannot care less of your "object graphs"; I care about representing
data from application domains (note that the ralational data model
can represent graphs when this makes sense; this representations are
not "distorted" beacuse there is not such thing as a "true representation").

> I think leaving object graphs as object graphs should be the
> default approach, because it is the simplest, from an OO mindset. OO
> with contracts can also express arbitrary structures, manipulations
> and integrity... declaratively if you really want (that's where the
> contracts come in).

Fantasies: languages with "contracts" does not usually transform
*automatically* your contracts (specifications) in procedural code;
you need to write this code *by hand*. Also, in this way you fail
to separate the correctness of the implementation and the checking
of data integrity (so-called business rules), which is an application
domain concern (physical/logical confusion again).

This in contrast even with a typical SQL DBMS which performs what is known
as Automatic Programming (an instance of metaprogramming). See this quote
in Feature-Oriented Programming:

    http://cliki.tunes.org/Feature-Oriented%20Programming

    The future of software engineering is in automation. The best
    practical example of automated software engineering is relational
    query optimization. That is, database queries are expressed using
    a declarative specification (SQL) that is mapped to an efficient
    program. The general problem of mapping from a declarative spec
    to a provably efficient program is very difficult -- it is called
    automatic programming). The success of relational databases rests
    on relational algebra -- a set of fundamental operators on tables
    -- and that query evaluation programs are expressions -- compositions
    of relational algebra operators. Optimizers use algebraic identities
    to rewrite, and hence optimize, expressions, which in turn optimizes
    query evaluation programs. This is how databases "solve" the automatic
    programming problem for query evaluation programs.

Note that here is examined only query evaluation, but the same is true
for the physical data representation on disk, which can change based
on collected statistics of data usage or by DBA "hints" (tuning).

> -- 
> Robin

I will stop here in this branch of the thread, I don't want to waste my time
repeating what is better addressed elsewhere. If you are *really* interested
then consult the Database Debunkings site:

    http://www.dbdebunk.com

Don't dismiss it after reading some lines because you "feel" insulted.
It is a coherent whole and needs time to be digested.

Regards.

--
Massimo Dentico