Greetings, Wanted to suggest that the team will look (again) at implementing --unsorted option for 'uniq'.
The idea was proposed (and rejected) about 10 years ago (https://lists.gnu.org/archive/html/coreutils/2011-11/msg00016.html). Lot of things have changed from the past. First the justification, which was clearly stated in the original post:: 'uniq' currently relies on 'sort'. When the input file is small, this is OK. But when the input file is large, this seems to be a waste (the complexity is O(n log(n)), if uniq handles a hash table its self the complexity is only O(n)). I'm wondering if it is better to relax the requirement of 'sort' when 'uniq' is used. I think it is worth reconsidering based on the following: Based on my experience, a very common use case is to count frequency of small number of buckets from a large input set (for example, count number of requests for each IP IP addresses from very large web server log file). Currently, this is handle by using a pipe ilke 'awk { print ...}' input-file | sort | uniq -c'. To count Z buckets from input size of N, the performance of the above pipe is N log(N), with memory requirement to hold N input in memory (otherwise actual performance for large data set is extremely slow). With the proposed `uniq --unsorted`, performance is O(n), memory requirement to hold only Z input lines in memory (not to mention much better constant). The difference is significant consider the input size of today log files: could be a factor of 20, when input has millions of lines. Other common use case will be to use this as a "filter", where the input arrive from "tailing" the file. For example "tail -f logfile | awk '{ print ... }' | uniq --unsorted" will look for new unique lines in the file. Not possible with the sort approach. The main objection in previous discussion was that implementation is complex seems a little bit outdated. The creutils have en extended over time with advanced features, beyond what could implemented by few simple lines of code on top of the standard library. There are many good implementation of hashing that can be be implemented, which will not increase code footprint by more than few tens of lines. Secondary objection was that it will be impossible to handle the case of infinite input site. Well, this argument is true for many problem - but given today hardware, this is unlikely to be an issue for all practical usage. Can you advise/provide feedback. I'm sure that there will be many volunteers (me included) to contribute to such important improvement. Yair