[gclist] Re: Articles expiring too fast!

Erik Naggum erik@naggum.no
29 Jan 1997 14:14:26 UT


* Henry Baker
| In the olden Unix days, when each news article was a separate _file_, I
| checked to see if 'cross-posted' articles were stored in multiple
| groups--i.e., unix directories.  Yup!  They didn't even utilize Unix's
| capability for storing the same file in multiple diretories!

I think this is untrue.  ever since the B news I installed on my much too
small iron in 1988, cross-posting has used Unix hard links.  this has
probably been so since the earliest days.  however, it's hard to manage
multi-gigabyte disks on many Unix boxes, so instead of mounting one disk
for news, you mount many, and since the file system mirrors the newsgroup
hierarchy, a disk usually corresponds to a major hierarchy.  unfortunately,
this means that you need one copy per disk, but this is infrequent.

| And, of course, the per-file overhead for these small files is
| mind-boggling--the minimum file size on some file systems may be as large
| as 32k or 64k bytes.

remember, you can tune a file system, but you can't tune a fish.  news file
systems differ greatly from user file systems -- they use block fragments
as small as the system provides.  1K is usual.  512 bytes is not uncommon.

the problem with managing news systems today is that expire and other tools
are asynchronous.  the system already knows everything it needs to know to
do very painless and fast expiration, maintaining an essentially flat use
of space by removing old articles when new ones come in.  in fact, that
list of articles-to-be-removed can be maintained very easily.  since RAM
isn't an object these days (at least not compared to the massive disks they
spend on news), and since news servers have so severe problems if they
crash that system administrators do everything in their power to keep them
running, expiration could be a "simple" question of keeping allocation
information in memory.  (the same problem applies to databases -- as soon
as you don't trust the memory, you face a staggering performance wall.  if
you _can_ trust the memory, you should forget the disks.)  also, since all
messages are uniquely identified by a message-ID, mapping from newsgroups
to message-IDs is a trivial exercise.  this means you can allocate space to
a message-ID instead of to an article in a given newsgroup, and only keep a
mapping from article number in a newsgroup to message-ID.  etc.

however, all these good ideas run into the same roadblock: the dumb user
agent, which depends on each article being a separate file identified by a
number in the filesystem hierarchy corresponding to the newsgroup
hierarchy.  this is the real problem.  however, this, too could have been
solved if Unix allowed loadable file systems that could fake it, while the
real storage management was message-ID-based.  in other words, a Network
News File System.  this file system could also be distributed, such that
only message-ID's and perhaps a few headers were exchanged, and servers
would pick up articles as they were referenced.  if they requested articles
through other servers the same way news propagates today, those other
servers could keep copies or know where to get an article.  new newsgroups
could be created without a lot of hassle, and the arbitron data would be
accurate.  (arbitron is the USENET readership statistics.)

the real impediment to change is not good ideas, it's not that things could
be done differently, it's not that they cannot be implemented at one site
and spread like wildfire, it's that Unix folks have a closed mindset.

#\Erik