For a while, I've been using a small program I wrote (with help from a GPL AVL-library) to filter unsorted duplicate lines. I thought I might see if this can be added to `uniq` (or some other way) but I saw that a nearly identical proposal (https://lists.gnu.org/archive/html/coreutils/2011-11/msg00016.html) was already put forth and rejected.
I thought it might be worth it to make the case again, with an expanded rationale, and especially as I already have a proof of concept (available below) and I'm willing to write the code, documentation, translation, etc... It was said in the replies to the original proposal that it's up to the user to decide whether they want to run `sort` and then pipe it to `uniq`. But in all the years I've used coreutils, I've never once used `uniq` without `sort`. I've spoken to many others, and their experience comports with mine. But this was not because I wanted the output to be sorted; in fact, I specifically didn't. Most times, I want (and even require that) the duplicated lines be stripped as soon as the data becomes available, and remain in the original order. This is especially useful for log files, journals, output from statistical software, etc... The pervasive `sort | uniq` idiom, of course, besides for changing the order of the data, carries the other problem of completely arresting the flow of data (as `sort` has to read all of the data in the pipe in order to work). I view this as a limitation since it counter-acts one of the main benefits of using a CLI pipeline, namely that the whole pipeline works in unison and reads data in a streaming fashion. The most sensible place to add this functionality (that I think many people would enjoy) is as an option for the `uniq` command (line `uniq --stream` or similar) It was also said in the original replies that this might constitute feature creep, and `uniq` as it stands now is less than 200 lines of code. I'm sympathetic to this view, especially since adding a tree or hash to `uniq` would considerably increase its size. But maybe that's an ok thing. Especially if it brings the functionality of `uniq` more in line with people's expectation of the command. It would also not disturb the user with increased memory usage; the tree would only initialize if the user specifically specified that they wanted this option. (proof of concept) [https://github.com/tonyfischetti/eweniq] (apologies, this is before I knew better to use a more free hosting for version control)