[gclist] Name that hypothesis

Charles Fiterman cef@geode.geodesic.com
Tue, 10 Dec 1996 15:03:41 +0000


At 02:39 PM 12/10/96 -0500, you wrote:
>I'm really curious about your qsort results.
>
>Most people identify "qsort" with "pick the first (last) element as the pivot", 
>an identification that becomes clear when they tell you that a pre-sorted input 
>is the worst case, and produces O(n^2) behavior (i.e., every partition pulls off 
>just 1 element).
>
>In fact, *even Tony Hoare's original paper on quicksort* recognized this 
>problem, and recommended *against* using the first or last element.  Hoare 
>suggested taking the first, middle, and last element and using whichever was "in 
>the middle" of those three values as the pivot.  Knuth later showed that doing 
>this by sorting those three elements and using the resulting center element as 
>the pivot was "safe", in that it didn't affect the statistical assumptions that 
>give you the "average O(n log n)" case.  If you do it this way, the resulting 
>partitions are optimal for both the initially sorted and initially sorted 
>backwards cases. It also guarantees that neither the first nor the last element 
>will be swapped, which you can use to eliminate most of the index boundary 
>comparisons.  For small keys, this can be half the cost of running the 
>algorithm!

My experiment with middle of three showed it was a lot of overhead for nothing.
It rarely prevented pathological behavior. I rejected Knuth's suggestion of
random pivots as simply making the bad cases hard to find and finding random
numbers isn't that quick anyway. 

If there is order center pivot is clearly best. If there is no order it clearly
doesn't matter so center pivot is not bad. But order can mean things like merging
decks where center pivot is a disaster but then my 1/4 1/8 strategy works well.
This leaves problems like all keys 7 or 8. This is pathological unless you are
allowed to expand the center partition. But in my tests this cost too much in
things like the random case. Then I noticed that shell sort was nearly optimal
in cases like all keys 7 or 8. So I came up after 1/8 go to shell sort. This was
tuned to my benchmarks. But the key is to note that if you can prevent pathological
behavior you wont do that badly. I was running a dozen tests in under a minute
while others were taking hours. If I had simply expanded the center partition I 
would have still beaten everything but my own later work and I wouldn't have lost 
by very much.

>The worst case is still O(N^2), and your "merging decks" example produces this 
>behavior.  (It's still "twice as good" as pure center-pivot - it pulls off the 
>two highest or two lowest elements each time.)

No it still chooses a pivot. The three choices will still be 1 100000 and 100000 
the center must be bad. The next time you get the same kind of rotten choice. You
are getting two things sorted out per pivot not one because each key is duplicated.

>Using your idea of noticing when a partition is bad and then using, not the 
>center element, but the one 1/4 of the way in, combined with continuing to take 
>the "middle of first, nth-of-the-way in, and last", would work very well for 
>this case.  Another alternative is to generalize the "middle of three".  There's 
>an old varient, perhaps even due to Hoare, called SampleSort, which says that 
>you should sample k elements, sort them, and choose their middle element as the 
>pivot.  Statistically, k=3 is already "good enough".  "Deck shuffling" has a 
>bias that makes it fail badly when k is fixed at 3, and the 3 elements are 
>always the first, middle, and last.  One could step up to k=5 when things go 
>bad, or stick to k=3 and use 3 *randomly chosen* elements.  In either case, the 
>worst case remains O(N^2), but it's going to be quite difficult to construct an 
>example.  (I will admit that hardly anyone bothers to do this.  The "shuffling 
>decks" case is pretty unlikely, and when it does occur, the two "decks" are 
>almost certainly of different lengths.  Smaller partitions in this case are of 
>length something like m+2, where m is the difference in the "deck" lengths, so 
>for any reasonable m, the performance isn't too bad.)

What you are going to do is spend so much time choseing a pivot that you will degrade
the simple cases like random data and sequential data. And what ever you do there
is still a pathological case. The problem is not to have a pathological case. Ever!
Not to hide it so you don't hit it. That would be truly codeing to hit the benchmark.

What I did is restate the problem. Write qsort so that no amount of labor will ever
produce a pathological case. Within that go for the benchmark. I didn't even try
Knuth's suggestion of random keys because I considered it a kind of cheating.

>I've never seen qsort code that uses a pure center-pivot algorithm.  Anyone who 
>propagates such code simply doesn't know what he's doing.  (That's not to say 
>there isn't *tons* of such code out there.)

Microsoft sells it or did three years ago, so did a few unix vendors. In fact what
Microsoft did was really grungy. Before sorting a partition they checked if it
was already in order. This is really going for the benchmark. They did worse than 
anybody and deserved to. They did spectacularly on sorted data but I did almost
as well and continued to do almost as well on backward data while they went into 
the multi hour pits.

>Quicksort is one of the most heavily studied algorithms around.  Optimizations 
>for the many-equal-elements case are known.  (Essentially, pull runs of equal 
>elements "off to the side" as you find them; leave just one "representative" to 
>participate in later sorting passes.)  A less-well-known, but very important 
>optimization has to do with the end case.  Most people just let the recursion - 
>whether written that way or with explicit stack manipulation - run down to some 
>trivial case, often a 1-element partition (or a 3-element partition if they are 
>using the "center of 3" algorithm).  This is bad, since there is a large fixed 
>cost to do a partition run.  It's much better to sort "small" partitions - the 
>size of which depends on all sorts of details, but is typically in the 6-12 
>element range - by some other, low-overhead methods - often bubble sort.  Ah, 
>but in fact one can do better!  Simply ignore "small" partitions as they are 
>formed - detect them *before* recursion/stack push and do *nothing*.  When 
>quicksort is done, do a single bubble-sort over the entire array.  Since the 
>cost of bubble-sort depends on the maximum distance an element must move, and no 
>element can possibly leave the last partition that contained it, and that's 
>certain to be small, you get essentially the same effect as bubble sorting the 
>small partitions directly, but you only pay the fixed costs of doing the bubble 
>sort once, and you save the cost of keeping track of the small partitions.
>
>Shell sort isn't bad, but quicksort is almost always faster, and has rather nice 
>memory access patterns in a VM system (though the access patterns aren't LRU; 
>ideally, you'd want to be able to tell the pager that they are linear).  Except 
>in artificial circumstances, I'm not sure the effort of writing two complete 
>sorts is worth it, since the "bad" cases for quicksort can so easily be made 
>very unlikely.

Shell sort also has nice VM patterns I spent time thinking about it. Yes qsort
is almost always better but I only used shell sort when I saw qsort going
in the pits. It just happens that those cases make shell sort look very good.
And you know I tried heap sort, insertion sort etc. This problem was sort of
a hobby for about a month. Note that shell sort has no pathological cases so
on the benchmarks it would have won by massive margins against anything but
my qsort.



Charles Fiterman	Geodesic Systems
Phone 312-728-7196	4745 N. Ravenswood Suite 111
Fax   312-728-6096	Chicago Il 60015
cef@geodesic.com

A computer language without garbage collection
  is like a city without garbage collection.