Google Summer of Code

Fri Jun 10 07:54:48 PDT 2005

On Fri, Jun 10, 2005 at 09:59:24AM +0200, Massimo Dentico wrote:
> So, you implement a common protocol (interface) for *all* your collection
> types and you end up with... what? If it is sufficient general then this
> protocol is no more powerful than the Relational Algebra or the Tuple/Domain
> Calculi (this is not exactly true, RA is not Turing-equivalent on purpose).

That is a crucial difference. Therefore it _is_ more powerful.

> You are inventing a square wheel: *this* is frankly ridiculous.

You see an OO collection as a square wheel, because to you relations are
"natural"; whereas I see a relational schema as a step backwards from OO,
because to me OO is a very "natural" way to do things.

> >It is silly to claim that OO does not separate "physical" and "logical",
> >because that is precisely what e.g. access control (i.e. "private" and 
> >"public"
> >etc.) does.
> 
> You are silly to claim that OO does separate "physical" and "logical",
> but of course I don't blame you, it is the outcome of current "education"
> in this field.
> 
> Please, direct your browser at this URI and educate yourself,
> "Does OOP really separate interface from implementation?":
> 
>    http://pobox.com/~oleg/ftp/Computation/Subtyping/

Summary: When you don't follow the Liskov Substitution Principle, your class
will fail to implement its interfaces correctly. This is a tautology.
The conclusion is: follow the LSP.

(Actually there is a subtlety here: I prefer to say that subclasses should
follow their superclasses' *contracts*, not their superclasses' *behaviour*
- otherwise you end up creating unnecessary extra abstract classes/interfaces,
which may mean API breakage when you subclass a previously-unsubclassed-class.)

> Note that the Liskov Substitution Principle (LSP), that I called harshly
> "sacred" on another mailing-list

Ironically, it's not sacred - it's broken regularly in practice, and I suspect
the majority of OO programmers wouldn't recognise the term - but it *should* be
sacred ;)

> , was discussed by Date also. When Oleg
> Kiselyov wrote that "a Set is not a Bag" in his example, he bases his 
> conclusion
> on the implicit assumption that OO languages have "pointers semantics"
> (I lack a better term in this moment), or that Objects possess a Unique
> IDentifier (OID). With a pure "values semantics" there are no problems
> and "a Set is a Bag" (or a Circle is an Ellipse), that is it is true
> the intuitive "subtyping is equivalent to subsetting", as shown by Date.

No, the values semantics does not determine whether a Set is a Bag.
Only the operation semantics ultimately determines that.

Kiselyov defines a Bag in this article as "an unordered collection of
possibly duplicate items".

But this is insufficiently precise. This definition does not include
any *operations*. Without that, you don't know whether the operation
called "add" will add everything, add only things that match certain
criteria (like "not already in the collection"), or do something crazy
like delete everything.

To talk meaningfully about whether one OO type is a subtype of another
OO type, we must unambiguously specify the operational interface of both
- that is, all those operations which will be present as public methods,
constructors, destructors or attributes (properties) in those classes.
Anything less is handwaving.

So.

I put it to you that whether Bag is a subtype of Set depends fundamentally,
not on values vs pointers semantics, but on the definition of the add
operation on Bag. (Although, admittedly, the choice of value semantics or
pointer semantics tends to influence the definition of the add
operation which is chosen.)

We can define a reasonable data type which models a mathematical bag,
and whose operational definition makes it a supertype of Set in both
pointer-land and value-land. We can also define a reasonable data type
which also models a mathematical bag, and whose operational definition
makes it *not* a supertype of Set in either pointer-land or value-land.

I will call the former the "permissive model" of a Bag, and the latter
the "strict model":

 PERMISSIVE MODEL:

Operations which modify, or return an updated copy of the bag *of the 
same runtime type as the original bag*, are not defined, in order to allow
subtypes to impose arbitrary preconditions on what may be added, when,
etc. Only "read operations" like size and contains are defined. I call
this "permissive" because this is kind of the ultimate in permissiveness
from the point of view of the subclassers.

This *does not* mean that permissive bags are immutable! Some are, some
are not. It merely means that the definition of mutation operations are
left to the subtypes. In extremis, a subtype might not even *have* an
add method - it might be a fixed-size mutable collection, like an array,
for example. Arguably, an array is a bag. Fixed-size arrays certainly do
not have add operations.

 STRICT MODEL:

size :: Bag T -> NonNegativeInteger

add :: Bag Self T -> T -> Bag Self T
postcondition: size (add c x) == (size c) + 1
 // Adding an item increases the size by one

This psuedocode notation demands explanation. It is based on Haskell.
"Bag Self T" means a Bag, whose leaf type (i.e. its runtime type) is Self,
which contains zero or more objects of type T. For example, if we had a
StringArrayBag [I chose this to avoid writing out an infinite or recursive
type], that would be a "Bag StringArrayBag String", so Self for that
instance would be StringArrayBag, and T would be String.

So

add :: Bag Self T -> T -> Bag Self T

means that add takes two arguments, a Bag and an object, and returns
a Bag of the *same leaf type* as the original. This is a crucial constraint.
It might sound esoteric, until you realise that all instance mutator methods
in Java and C++ are like this. None of them change the runtime type of the
"this" object [at least, not according to the language's type system - the 
application may choose to discriminate more finely]. It's not at all
esoteric. It's a normal expectation.

The strict model, in my estimation, is the prevailing definition of Bag in
the programming community. Constrained Bags (and Sets) are a minority
pursuit. Most OO collections are unconstrained (except by member type
- which is a bit of a strange thing to say, because member type is
an important constraint. But I hope you know what I mean). Unconstrained
Bag types correspond to the Strict model, whilst constrained bags can only
fit into the Permissive model (because of the LSP).

If I am right in that estimation, we can say that Set is not a subtype
of Bag, under their prevailing definitions, period.

But how is this possible, you may ask, when in a value semantics system
like a relational database, Sets are a subset of Bags?

Actually they cannot be so, under the Strict model. That is precisely my point.
If a Set s is in a set of Bags, then s is a Bag. But I have just proved that
it is cannot be, under the strict model. Therefore, Sets are not a subset
of Bags. Rather, you have Sets which have the same elements as Bags, but which
do not (and can not) have a conformant add operation, and therefore are not
Bags under the Strict model.

In my new programming language's class library, I have both strict model
and permissive model collections. The strict model collections are subtypes
of the permissive model collections (LSP only allows that way round).
Both are useful. If you want to have strict model collections - or indeed,
more generally, *any* values which preserve their type when they are modified
- the runtime type of a value can not be derivable from any of the value's
attributes alone - as it is in the simplistic "subtyping is subsetting" idea.

Sorry for the dense prose. More explanation would require even more text. I tried
to keep it short. ;)

> >> 1) the "network DB" part: accessing data by "navigation" is more
> >>   difficult and less efficient than accessing them by value
> >
> >Agreed. However, prevalence does not stop you implementing indexing.
> >You can implement indexes however you like, unlike with SQL databases
> >where you typically end up relying on their dumb, simple implementations.
> >Code will surely be reused for this if prevalence becomes really popular.
> 
> ?? This has nothing to do with indexing. When you establish a *fixed*
> access path to data (with web of pointers in your graph of objects)
> you are in troubles. See "1.2.3. Access Path Dependence" in "A Relational
> Model of Data for Large Shared Data Banks" by E. Codd:
> 
>    http://www.acm.org/classics/nov95/toc.html
> 
> The fact that you construct shortcuts (indexes) does *not* solves the
> problem.

You are talking about a different problem now - the brittleness of access
paths, rather than their alleged inefficiency (I should have qualified my
agreement above - they are *sometimes* slower than table lookups, but
sometimes faster.)

Databases are no different. They have "fixed access paths" - queries.
And just as you can librify the SQL, you can librify the access paths.
Copying and pasting anything more complex than identifiers - whether it is
access paths like foo->bar->foo, or SQL like SELECT foo FROM (SELECT foo,
bar FROM ...), is a recipe for burdening yourself with
searching-and-replacing later. ;)

You might seek to claim that "queries in the pure relational model don't
change, whereas access paths do". But this is self-evidently nonsense.

And even if it were true, it would be irrelevant - librify the access paths,
so that you only have to change them in one place. This is not rocket
science (even though some advocates of "adaptive programming" seem to
think it is a groundbreaking new invention, it is not).

> >It's not for everybody (too much work). But prevalence has its advantages
> >... OO model;
> 
> So called "OO" is more hype than substance: the term does not have
> a single, clear, agreed upon definition; this is revealing.

Some people have tried to pass off "object-based" systems (no inheritance)
as "object-oriented". They are snake-oil salespersons. The essence of OO is
reasonably clear within the computer science community, I think:
encapsulation, inheritance and polymorphism.

> There
> are varieties of OO languages that have important differences (think
> about classes vs prototypes, for example).

I cannot comment on this because I don't know enough about prototype-based
languages.

> b) are 100/500/1000 lines of functionally equivalent procedural-navigational
>   code more "penetrable" than that 10 lines of SQL? I doubt it.

Sometimes, yes. However, you would not have 1000 lines replacing every query 
- that is copy-and-paste brain damage.

>   Use a good declarative language and that 10 lines reduce to 2/3
>   or even less, see KDB
> 
>       http://kx.com/news/in-the-news/pr-041228-vector.php

I prefer not to program in something that looks like line noise. ;)

> >> 2) the "*without* MS" part: it put us back to pre-DBMS era, where
> >>   each application dictate its own file format
> >
> >Er, with DBMSs, each app can (and sometimes does) dictate its own table 
> >format.
> 
> Of course, you fail to distinguish the physical and the logical level.
> Nothing more wrong: a relational schema is not like a file format.

Yes it is: If you don't know the database schema, you can only treat the data as
meaningless values, which is not very useful.

> >Not surprisingly, this is not a big problem in practice. Although mapping
> >between object models can be slightly more complicated than mapping between
> >relational schemas, I'm not convinced that it's typically
> >"difficult or impossible". Can you put forward a specific example?
> 
> I was not speaking about schema reconciliation (merging 2, or more, 
> different
> data bases, developed independently, with overlapping content). I was 
> speaking
> about applications that access the same data base, eventually contributing
> each its original data and accessing data of other applications.
> 
> Now, when 2 or more concurrent Prevalence applications want to share
> data, how they does do this? I guess: each application speak with each
> other application (RPCs, named pipes, sockets, whatever..) and every
> application multiplexes messages to the appropriate objct(s) in its
> address space. In simple terms, they emulate the Actor model.

Or, if they want to be more efficient, they can share an address space.
This can be made secure and at least as robust as when they do not
share an address space.

Whatever.

> Now, here arises the problem (see below): how do you query your objects?

By calling methods?

> I mean, application A is written without knowledge of application B and
> vice-versa.

So? All they have to know about is the classes that they need to talk to,
not the whole of the other application.

> How does they exchange data without previously agreeing upon
> a common interface (protocol) for each class of interest?

What is the problem with agreeing on a common interface? That's how all
OO programming should be, whether interprocess or not!

> And if you don't
> know what are the data (classes) of interest for future applications,
> what do you do? Do you limit data access? Do you expose all classes?
> Do you write wrappers?

It depends.

Data hiding is a tradeoff. When you hide a data structure, you lose
the ability to access it directly from outside, but you gain the ability
to delete it or modify what it contains without having to worry about
code on the outside. A lot of programmers seem to think that these
tradeoffs are worth it. "10,000 programmers can't be wrong" ;)

> >>   NO integrity assurance
> >
> >Anything that's Turing-complete can do anything that anything else can do.
> 
> The usual misuse of Turing-equivalence: so, why not going back to
> program in assembly? No, better, in machine language?

I'm not misusing it. Just because something is not convenient now, doesn't
mean it won't be convenient in the future, when better systems are developed.

> I know how to arbitrary query my relations simply knowing their schema
> (if I don't know, I ask the catalog -- aka, metadata) and it is possible
> to program generic tools, simple enough that even a motivated end user
> can construct arbitrary reports.

It would not be hard to make such generic tools for Prevayler, either.
Not hard at all.

> How do you query your objects without an agreed upon protocol?

Call methods? Examine the attributes?

> Do you consult the interface (protocol) of each class?

Only the classes you are interested in. You have to consult the schema of
a database unless you already know the attribute names you want, so this
is no big deal.

> Do you let
> your end users to do this and write procedural code to print a simple,
> not pre-programmed, report?

No, you provide code to help them. They should not have to write loops
and things themselves.

> Do you present your objects to the user and
> say "look, this is the interface".

The objects' interfaces can be as low-level or as high-level as you like.
You can have a method called selectCustomerWithName(String). No problem.
You can even have a SQL-style query language. You can even have a wrapper
that presents the prevaylent system as a virtual SQL database to the world.

> >So librify it! Write a constraints solver once and reuse it! What
> >is the problem here?
> 
> Do you understand that you are constructing, with your "librify it",
> "what is the problem", a DBMS, bit by bit?

Yes. An OODBMS, which I prefer to an RDBMS.

> Of course, let's go to "perfection" it and we will see if we will
> approach the most stupid DBMS.

Let's place our bets ;)

I predict that OO languages will go one better than RDBMSs. Eventually,
they will not merely "error out" on constraints violations like RDBMSs
can, but they will actually model-check a program to prove that it is
incapable of violating any integrity constraints. (That's my main
research interest.) Beat that!

-- 
Robin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : /archives/tunes/attachments/20050610/72034e58/attachment-0001.bin