[gclist] Re: Articles expiring too fast!

Henry G. Baker hbaker@netcom.com
Mon, 27 Jan 1997 21:23:53 -0800 (PST)

> I know this is off-topic for GCLIST, but in my opinion the problem with
> usenet news is that the floodfeed architecture is wrong.  It doesn't
> scale.
> -- Steve

I'm not so sure that it _is_ offtopic.  I have wondered ever since my
Symbolics daze why the Symbolics news and mail servers didn't store
stuff just once when it was broadcast.  For example, if a workgroup
shares a file system, and one user broadcasts a message to all of his
associates, that message is _copied_ N times _in the same file server_(!)
and the multiple copies 'blow up' the file system as a result.  Even
worse, if any fraction of these people keep copies of this message
in their 'already read' file, then these multiple copies may stick
around for ages.  I once concluded that some of the file servers at
Symbolics had stuff duplicated 50-100 times over!

In the olden Unix days, when each news article was a separate _file_,
I checked to see if 'cross-posted' articles were stored in multiple
groups--i.e., unix directories.  Yup!  They didn't even utilize Unix's
capability for storing the same file in multiple diretories!
And, of course, the per-file overhead for these small files is
mind-boggling--the minimum file size on some file systems may be as
large as 32k or 64k bytes.

And so far, we're only talking about 100% copies of things.  If you
start talking about files that are minor modifications of one
another-- e.g., multiple versions of some software system, you get
blow-ups of two orders of magnitude as well.  I once examined the
10-year life history of a very large program, that went through 100's
of versions, and the % of modifications between changes were very
low-- a few % at most.  Interestingly, a very large fraction of the
files were touched in each update, but only with the most minor
changes (in most cases).

Yes I know that there are specialized systems with version control, etc.,
but why deal with a system specialized for only source control of programs,
when nearly every kind of document has exactly the same problem.  This
is why some form of bulk heap storage would be truly valuable.

Henry Baker
www/ftp directory: