On 6/19/24 16:27, Pádraig Brady wrote:
On 19/06/2024 19:48, Mark E. Shoulson wrote:
Turns out whatever I'm doing doesn't work right when the input is a pipe
(though input redirection from a file is okay), so in addition to being
incomplete, this patch is actually buggy.  Will work on it, but still
hoping to hear opinions.

~mark

On 6/18/24 21:49, Mark E. Shoulson wrote:
Hi. New here, to post a patch which I hope will inspire someone to do
a better job.

I can't really believe that for all these years, we still so very,
very often run sort(1) and pipe it through head(1) or tail(1), because
we only want the top/bottom 20 lines or whatever.  That means we're
sorting a potentially gigantic list of lines but not caring about most
of the information.

I'm attaching a patch that adds `--head` and `--tail` options to
sort(1), so you say `sort --head=20` (which is equivalent to saying
`sort --tail=-20` and vice-versa).  I've worked this into sort(1) VERY
VERY BADLY, I'm sorry to say, but maybe someone with better
familiarity with the code can do it better.  Basically, I'm
special-casing out the situation when --head is active and doing my
own thing with it, and then lying to the rest of the program about
it.  But it works, anyway.

The functionality is useful and similar to the --unique optimization.
We'd have to do better than n^2 though.

Ah, but it isn't n^2.  True, I'm using insertion sort, which is n^2, but it isn't really n, the number of lines, it's only k, the number of lines in the head that you're asking for.  I've explicitly limited the size of the head buffer (somewhat arbitrarily) to a maximum of 256.  Maybe 1000 or 2000 would be more reasonable.  But in any case, if k << n, then O(nk) is basically O(n), and if k isn't much much less than n, then n isn't very big anyway and it hardly matters.

The advantage here, what it has over sort|head, is that if an incoming line isn't smaller (larger) than the top line of the head buffer, we don't have to deal with it *at all*, and we can throw it away after a single comparison (which is the best-case situation: an already-sorted list is O(n)).  So we aren't actually sorting the whole input.

I think it is very reasonable for this feature to limit the size of the head/tail that you can take; if you're interested in something more serious, you can always use true sort|head.  This is an optimization for when you know that you're only taking a small part of the output (actually, a lot of the time people use head -1!)  This just seems like such an obviously useful feature I'm surprised it wasn't there already.

I probably have not properly accounted for the interaction of --head with things like --unique and who knows what else.

~mark


Reply via email to