Strucutred objects vs. flat ASCII files

Paul Prescod papresco@calum.csclub.uwaterloo.ca
Fri, 09 May 1997 12:01:52 -0400


Mike McDonald wrote:
> >A nice OS needs a standard way to store files. On Unix the standard is
> >ascii with new-lines. This is bad. This makes unnecessary work. This
> >is why we have perl and awk and sed. (Need I say more?)
> 
>   You have a better way of storing unstructured text than as ASCII
> characters seperated by newlines? 

Well, since when is a mail folder "unstructured text"? It is implicitly
structured text. Also: newlines are ASCII characters.

> The advantage of flat files is they are more flexible than sturctured ones. 
> I can always build structure
> on top of a flat file but I may not be able to flatten a structured
> one. 

Not true. All structured files are linearized in their storage on disk
and in RAM, for instance. You also cannot build structure on top of a
flat file if there is not enough information in it. Have you ever sent
mail and had a ">" sign inserted before the word "From:". That's the
stupid mail system trying to get around the fact that there is not
enough structure. Another example is going through the lispos mailing
list archives and trying to figure out what is original text and what is
someone else's text quoted in a particular message. Since the structure
is thrown away between the mail program and the archive that information
is not retrievable.

Of course a *structured file format* (such as SGML) can capture all of
this information, but it is also harder to use with GREP, Perl, Awk,
etc. The more structure (information) there is the more these tools that
only work with the text data will fall down. So you end up reading the
structured format in, getting rid of the structure (tags) and then
piping it to these programs.

> In your example, what if I want to search the entire message? Now
> all of a sudden, your example becomes a lot more complicated than sed,
> awk, and perl. Or in my case, fgrep.

Not at all. I do this sort of thing all of the time in DSSSL, a purely
functional language. A mail message is an object, made up of other
objects. In DSSSL

(data *mail-box*) would return the text of the entire mailbox (perhaps
lazily)

(data (message-ref 1 *mailbox*)) would return the text of an entire
mail-message.

(data (subject (message-ref 1 *mailbox*))) would return the text of the
subject of a particular message. And so forth. You could of course use
"map" and "grep" to search all subject lines this way but the resulting
object will be still an object with full context information (as in,
"what message am I in" rather than a "line of text" in isolation as with
grep or awk.

In short, it is easier to THROW AWAY STRUCTURE (as the data function
does) than to REBUILD IT (which is often impossible).

The primary benefit to structured formats such as SGML is a) for
interchange and b) if everything else screws up on your LispOS system
then files in standard formats may be retrievable. Files in proprietary
or binary formats may not be.

 Paul Prescod