WhyOO (draft)

Francois Rideau rideau@mathp7.jussieu.fr
Tue, 7 Jun 94 8:21:52 MET DST


------------------------------------------------------------------------------
Why do we need Object-Orientedness and such other programming paradygms ?
=========================================================================

  99,9% of programming  time throughout the world is spent doing again and
again the same basic things.
  Of course, you can't escape asking students to repeat what the elders did so
they can understand it and interiorize the constraints of computing. The
problem is student programming represent less than 50% of programming time,
which means even professional people are spending most of their time writing
again and again new versions of earlier works, nothing really "worth" the time
they spend -- moreover, new work is often done by students or their elder
equivalent, researchers, which aggravate the time share professionals spend
doing things really new.
  Now, after all, you may think that such a situation creates jobs, so is
desirable; so why bother ?
  Well, rewriting is a serious problem for everyone: first of all, rewriting
is a loss of time, that make programming delays quite longer, thus are verye
costly; more costly even is the fact that rewriting is an error prone
operation and anytime while rewriting, one may introduce errors very difficult
to trace and remove; the most costly aspect is the time limitation of the
results of programming effort, for the rewriting scheme does make your
investment valuable for very short periods only -- until you have to reinvest
the same sum to rewrite again; moreover, even if produced code is actually
safe and bug free, the technique used to write it is so unsafe that external
programmers can't possibly trust it. Finally, good programmers (those who like
their activity) don't like spending most of their time for such trivial
problems, and would like to spend it for real, higher, theoretical or human,
problems. Actually, there are just as many of these problems as trivial ones
(infinity), but much less are solved because the trivial ones get all the
programmers' attention.
  Therefore, now we can assume that it is proved that code rewriting is a bad
thing, and that we thus want the opposite: code *reuse*. But how can we reuse
code without spreading errors in reused code, without introducing errors due
to misunderstanding of old code, and without escaping code obsolescence ?

  The first most common way to reuse code is to rely on standard libraries.
You wait for the function you need to be included in the standard library
and use it as the manual describes it. Unhappily, standards are long to come,
are longer to be implemented the way they are documented. By that time, you
will have needed new non-standard feature, and will have had to implement them
or to use non-standard libraries. So this kind of code reuse isn't the
solution. Moreover, it relies heavily on a central agency for editing revised
versions of the standard library. This does not mean no effort must be done
to build such a library, as a library greatly helps communication: it's like
vocabulary and culture; but with only vocabulary, you cannot say any new
thing. To say new things, you need to combine the vocabulary, and that's
grammar -- the structure of the language. For however extended, a library will
never contain all that everyone may want to express.
  It should stressed that computer languages have nothing to do with finished,
static "perfect" computer programs -- those can have been written in any
language, preferably a portable one (that is any ANSI supported language,
i.e. surely "C", even if I'd then prefer FORTH) -- computer languages have to
do with programming, with modifyings programs, creating new programs, not just
watching existing ones. Thus, the qualities of a programming language do not
lie in what can be done with the language (the language being hopefully
Turing-equivalent with libraries to access all the hardware), nor in the
efficiency of a straightforward implementation (a good "optimizing" compiler
can always be achieved and/or speed critical routines can be included in
libraries); they lie in the easiness to express *new* concepts, and to
*modify* existing routines. With this in mind, a programming language is
better than another if it is easier for a human to write a new program or
to modify an existing program, or of course to reuse existing code (which
is some extension to modifying code).

  The second most simple way to reuse code is just to copy-paste it, then
modify it to fit one's new particular purpose. But possible bugs or more
generally lack of features are spread accross the system; that is when you
want to modify a portion of code that was copied and pasted, you're on to
search all copies to modify them in the same way. This can be a very long
and difficult job, which can be another source of errors. Moreover, the
same algorithm may need many modifications to adapt different situations.
Thus this method is definitely bad for anything but reuse of mostly identical
code in another program.
  The next way to reuse code is to have written code that will include
run-time tests for all the different cases you may need in the future and
branch to the right one. This method proves hard, slow, uneasy to use, and/or
produces inefficient code. Moreover, it is very hard to anticipate one's
future needs.
  A variant that combines both previous methods is to group all those similar
code fragments into a library, so that you can both understand all future
cases (well, you can still modify this library if you need it in the future,
but we then see that this method is no heal to a bad designed *language*),
and duplicate code where you need it to achieve some efficiency, and at the
same time confine code propagation is in a known limited area. This means
extending the system's vocabulary on your own (or between you and friends).
As we already have seen, this is a good idea, but again, a base of vocabulary
however big, cannot replace a good grammar; culture can never replace
intelligence; it saves you a lot of work that's already been done by others
(but can be also be costly to acquire -- so don't always want to acquire any
unuseful culture), but won't speed up the remaining work (and as we saw,
there's always remaining work). If you want to dig a very long tunnel, unless
there's already one finished or almost done, you'd better look for efficient
digging machines than for the entrance of a formerly begun gallery.

  Then what are "intelligent" ways to produce reusable, easy to modify code ?
Such a method should allow reusing code without duplicating it, and without
growing it in a both unefficient and uncomplete way: an algorithm should be
written once and for once for all the possible applications it may have.
That's __genericity__.
  First, we see that the same algorithm can apply to arbitrarily complex data
structures; but a piece of code can only handle a finitely complex data
structure; thus to write code with full genericity, we need use code as
parameters, that is, __second order__. In a low-level language (like "C"),
this is done using function pointers.
  We soon see problems that arise from this method, and solutions for them.
The first one is that whenever we use some structure, we have to explicitly
give functions together with it to explain the various generic algorithm
how to handle it. Worse even, a function that doesn't need some access method
about an the structure may be asked to call other algorithms which will
turn to need know this access method; and which exact method it needs may not
be known in advance (because what algorithm will eventually be called is not
known, for instance, in an interactive program). That's why explicitly passing
the methods as parameters is slow, ugly, inefficient; moreover, that's code
propagation (you propagate the list of methods associated to the structure --
if the list changes, all the using code changes). Thus, you mustn't pass
*explicitly* those methods as parameters. You must pass them implicitly;
when using a structure, the actual data and the methods to use it are embedded
together. Such a structure including the data and methods to use it is
commonly called an *object*; the constant data part and the methods,
constitute the *prototype* of the object; objects are commonly grouped into
*classes* made of objects with common prototype and sharing common data.
*This* is the fundamental technique of /Object/-/Oriented/ programming; Well,
some call it that Abstract Data Types (ADTs) and say it's only part of the
"OO" paradygm, while others don't see anything more in "OO". But that's only
a question of dictionary convention. In this paper, I'll call it only ADT,
while "OO" will also include more things. But know that words are not settled
and that other authors may give the same names to different ideas and vice
versa.
   BTW, the same code-propagation argument explains why side-effects are an
especially useful thing as opposed to strictly functional programs (see pure
ML :); of course side effects complicate very much the semantics of
programming, to a point that ill use of side-effects can make a program
impossible to understand and/or debug -- that's what not to do, and such
possibility is the price to pay to prevent code propagation. Sharing *mutable*
data (data subject to side effects) between different embeddings (different
*users*) for instance is something whose semantics still have to be clearly
settled (see below about object sharing).

  The second problem with second order is that if we are to provide functions
other functions as parameter, we should have tools to produce such functions.
Methods can be created dynamically as well as "mere" data, which is all the
more frequent as a program needs user interaction. Thus, we need a way to
have functions not only as parameters, but also as result of other functions.
This is *Higher order*, and a language which can achieve this has a
*reflexive* semantics. Lisp and ML are such languages; FORTH also, whereas
standard FORTH memory management isn't conceived for a largely dynamic use of
such feature in a persistent environment. From "C" and such low-level
languages that don't allow a direct portable implementation of the
higher-order paradygm through the common function pointers (because low-level
code generation is not available as in FORTH), the only way to achieve
higher-order is to build an interpreter of a higher-order language such as
LISP or ML (usually much more restricted languages are actually interpreted,
because programmers don't have time to elaborate their own user customization
language, whereas users don't want to learn a new complicated language for
each different application and there is currently no standard user-friendly
small-scale higher-order language that everyone can adopt -- there are just
plenty of them, either very imperfect or too heavy to include in every
single application).
  With respect to typing, Higher-Order means the target universe of the
language is reflexive -- it can talk about itself.
  With respect to Objective terminology, Higher-Order consists in having
classes as objects, in turn being groupable in *meta-classes*. And we then see
that it _does_ prevent code duplication, even in cases where the code concerns
just one user as the user may want to consider concurrently two -- or more --
different instanciations of a same class (i.e. two *sub-users* may need toe
have distinct but mostly similar object classes). Higher-Order is somehow
allowing to be more than one computing environment: each function has its own
independant environment, which can in turn contain functions.
  As for those who despise higher-order and user-customizability, I shall
reply that there is *NO* frontier between using and programming. Programming
*is* using the computer while using a computer *is* programming it. The only
thing you get by having different languages and interfaces for "programmers"
and mere "users" is building plenty of inefficient languages, and stupefying
all computer users with a lot of unuseful ill-conceived, similar but different
ones. You also make development cycles longer by building hardly crossable
borders between language with different general functionalities, and prevent
much useful work from being done by users intelligent enough to write their
own modules, but who don't have time to write a complete application from
*scratch* (i.e. what is commonly provided to a computer user, including so
called professional quality commercial software), not even considering
making it suitable to exchange data with the other software they must use
("compatible").
  Some say that common users are too stupid to program; that's only despising
them; most of them don't have *time* and *mind* to learn all the subtleties of
advanced programming; but they often do manually emulate macros, and if shown
once how to do it, are very eager to write their own macros/aliases.
  Some fear that authorizing a "mere" user to use a powerful programming
language is the door open to piracy and system crash. Well, if the language
library has such security holes, it's a library misconception; if the language
doesn't allow the design of a secure library, that's a language misconception.
Whatever was misdesigned, it language should be redesigned, amended of
replaced (as should be "C"). If you don't want people to cross an ill-placed
fissured wall, you'd better rebuild a new better placed wall than hire an
army of guards to shoot at people trying to get into the other part of the
misplaced wall, or unwillingly trespassing in a crack -- and in the first
solution, don't forget to also hire people to cope with lacks due to the
wall's misplacement. And if what you want most is nobody trespassing, well,
just forbid people from ever nearing the wall -- don't have them use the
computer.
  The truth is any computer user, whether a programming guru or a novice
user, is somehow trying to communicate with the machine. The easier
communication, the quicker better larger work is getting done.

  To end with genericity, here is some material to feed your thoughts about
the need of system-builtin genericity: let's consider multiplexing.
  For instance, Unix (or worse, DOS) User/shell-level programs are ADTs,
but with only one exported operation, the "C" main() function per executable
file. As such "OS" are huge-grained, with ultra-heavy inter-executable-file
(even inter-same-executable-file-processes) communication semantics no one can
afford one executable per actual operation exported. Thus you'll group
operations into single executables whose main() function will multiplex those
functionalities.
  Also, communication channels are heavy to open, use, and maintain, so you
must explicitly pass all kind of different data&code into single channels by
manually multiplexing them (the same for having heavy multiple files or a
manually multiplexed huge file).
  But the system cannot provide builtin multiplexing code for each single
program that will need it. It does provide code for multiplexing the hardware,
memory, disks, serial, parallel and network lines, screen, sound. POSIX
requirements grow with things a compliant system oughta multiplex; new
multiplexing programs ever appear. So the system grows, while it will never
be enough for user demands as long as *all* possible multiplexing won't have
been programmed, and meanwhile applications will spend most of their time
manually multiplexing and demultiplexing objects not yet supported by the
system.
  Thus, any software development on common OSes is hugeware. Huge in hardware
resource needed (=memory - RAM or HD, CPU power, time, etc), huge in resource
spent, and what is the most important, huge in programming time.
  The problem is current OSes provide no genericity of services. Thus they can
never do the job for you. That why we really NEED *generic* system
multiplexing, and more generally genericity as part of the system. If one
generic multiplexer object was built, with two generic specializations
for serial channels or flat arrays and some options for real-time behaviour
and recovery strategy on failure, that would be enough for all the current
multiplexing work done everywhere.

  So this is for Full Genericity: Abstract Data Types and Higher Order.
Now, if this allows code reuse without code replication -- what we wanted --
it also raises new communication problems: if you reuse objects especially
objects designed far away in space and/or time (i.e. designed by other
people or an other, former, self), you must ensure that the reuse is
consistent, that an object can rely upon a used object's behaviour. This is
most dramatic if the used object (e.g. part of a library) comes to change
and a bug (that you could have been aware of -- a quirk -- and already have
modified your program accordingly) is removed or added. How to ensure object
combinations' consistency ?
  Current common "OO" languages are not doing much consistency checks. At
most, they include some more or less powerful kind of type checking (the most
powerful ones being those of well-typed functional languages like CAML or
SML), but you should know that even powerful, such type checking is not
yet secure. For example you may well expect a more precise behavior from
a comparison function on an ordered class 'a than just being 'a->'a->{LT,EQ,GT}
i.e. telling that when you compare two elements the result can be
"lesser than", "equal", or "greater than": you may want the comparison
function to be compatible with the fact of the class to be actually ordered,
that is x<y & y<z => x<z and such. Of course, a typechecking scheme, which
is more than useful in any case, is a deterministic decision system, and as
such cannot completely check arbitrary logical properties as expressed above
(see your nearest lectures in Logic or Computation Theory). That's why to add
such enhanced security, you must add non-deterministic behaviour to your
consistency checker and/or ask for human help. That's the price for 100%
secure object combining (but not 100% secure programming, as human error is
still possible in misexpressing the requirements for using an object, and
the non-deterministic behovior can require human-forced admission of unproved
consistency checks by the computer).
  This kind of consistency security by logical formal property of code is
called a formal specification method. The future of secure programming lies
in there (try enquire in the industry about the cost of testing and/or
debugging software that can endanger the company or even human lives if ill
written, and insurance funds spent to cover eventual failures -- you'll
understand). Life concerned industries already use such modular formal
specification techniques.
  In any cases, we see that even when such methods are not used automatically
by the computer system, the programmer has to use them manually, by including
the specification in comments and/or understanding the code, so he does
computer work.

  Now that you've settled the skeleton of your language's requirements, you
can think about peripheral deduced problems.





------------------------------------------------------------------------------
[draft]
- not building an artificial border between programmers and users --> not
only the system programming *language* must be OO, but the whole *system*.
- easy user extensibility --> language-level reflexivity.
- sharing mutable data: how ? --> specifications & explicitly mutable/immutable
(or more or less mutation-prone ?) & time & locking -- transactions.
- objects that *must* be shared: all the hardware resources -- disks & al.
- sharing accross time --> persistence
- reaching precision/mem/speed/resource limit: what to do ? --> exceptions
- recovering from exceptional situations: how ? --> continuations (easy if
 higher-order on)
- tools to search into a library --> must understand all kind of morphism in
a logically specified structure.
- sharing accross network -->
 - almost the same: tools for merging code --> that's tricky. Very important
for networks or even data distributed on removable memory (aka floppies) --
each object should have its own merging/recovery method.
- more generally tools for having side effects on the code.




* Structures:
-------------
we consider Logical Structures: each structure contains some types, and
symbols for typed constants, relations, and functions between those types.
Then we know some algebraic properties verified by those objects,
i.e. a structure of typed objects, with a set of constants&functions&relations
symbols, et al.

  A structure A is interpreted in another structure B if you can map the
symbols of A with combinations of symbols of B (with all the properties
conserved). The simplest way to be interpreted is to be included.
  A structure A is a specialization of a structure B if it has the same
symbols, but you know more properties about the represented objects.

* Mutable objects:
------------------
We consider the structure of all the possible states for the object. The
actual state is a specialization of the structure. The changing states
accross time constitute a stream of states.

* Sharing Data
--------------
  The problem is: what to do if someone modifies an object that others see ?
Well, it depends on the object. An object to be shared must have been
programmed with special care.
  The simplest case is when the object is atomic, and can be read or modified
atomically. At one time, the state is well defined, and what this state is
what other sharers see.
  When the object is a rigid structure of atomic objects, well, we assume that
you can lock parts of the object that must be changed together -- in the
meantime, the object is unaccessible or only readable -- and when the
modification is done, everyone can access the object as before. That's
transactions.
  Now, what to do when the object is a very long file (say text), that each
user sees a small part of it (say a full screen of text), and that someone
somewhere adds or deletes some records (say a sentence) ? Will each user's
screen scroll according to the number of records deleted ? Or will they
stay at the same spot ? The later behaviour seem more natural. Thus, a file
has this behaviour that whenever a modification is done, all pointers to the
file must change. But consider a file shared by _all_ the users across a
network. Now, a little modification by someone somewhere will affect
everyone ! That's why both the semantics and implementation of shared objects
should be thought about longly before they are settled.

Problem: recovery
-----------------
What to do when assumptions are broken by higher priority objects ?
e.g. when the user interrupts a real-time process, when he forces a
modification in a otherwise locked file, when the process is out of memory,
etc.
Imagine a real-time process is interrupted: will it continue where is stopped ?
or will it skip what was done during the interruption ?
Imagine the system runs out of memory ? Whose memory are you to reclaim back ?
To the biggest process ? The smallest ? The oldest ? The first to ask for
more ? If objects spawn, thus filling memory (or CPU), how to detect "the one"
responsible and destroy it ?
If an object locks a common resource, and then is itself blocked by a failure
or other unwilling latency, should this transaction be cancelled, so others can
access the resource, or should all the system wait for that single transaction
to end ?

  As for implementation methods, you should always be aware that defining
all those abstraction as the abstractions they are rather than hand-coded
emulation for these allows better optimizations by the compiler, quicker
write phase for the programmer, neater semantics for the reader/reuser,
no implementation code propagation, etc.
  Partial evaluation should also allow specialization of code that don't use
all the language's powerful semantics, so that standalone code be produced
without including the full range of heavy reflexive tools.


------------------------------------------------------------------------------
Summary:
========

* Axioms:
--------
= "No man should do what the computer can do quicker for him (including time
spent to have the computer understand what to do)" -- that's why we need to
be able to give order to the computer, i.e. to program.
= "Do not redo what others already did when you've got more important work" --
that's why we need code reuse.
= "no uncontrolled code propagation" -- that's why we need genericity.
= "security is a must when large systems are being designed" -- that's why we
need strong typechecking and more.
= "no artificial border between programming and using" -- that's why the entire
system should be OO with a unified language system, not just a hidden system
layer.
= "no computer user is an island, entire by itself" -- you'll always have to
connect (through cables, floppies or CD-ROMs or whatever) to external
networks, so the system must be open to external modifications, updates and
such.


  That is, without ADTs, and combinating ADTs, you spend most of your time
manually multiplexing. Without semantic reflexivity (higher order), you spend
most of your time manually interpreting runtime generated code or manually
compiling higher order code. Without logical specification, you spend most of
your time manually verifying. Without language reflexivity, you spend most of
your time building user interfaces. Without small grain, you spend most of
your time emulating simple objects with complex ones. Without persistence,
you spend most of your time writing disk I/O (or worse, net I/O) routines.
Without transactions, you spend most of your time locking files. Without
code generation from constraints, you spend most of your time writing
redundant functions that could have been deduced from the constraints.
  To conclude, there are essentially two things we fight: lack of feature
and power from software, and artificial barriers that misdesign of former
software build between computer objects and others, computer objects and
human beings, and human beings and other human beings.