The documentation of 'uniq' [1] says: "The input need not be sorted, but repeated input lines are detected only if they are adjacent. If you want to discard non-adjacent duplicate lines, perhaps you want to use 'sort -u'."
When I wrote gnulib-tool many years ago, as a shell script, I naturally used - 'sort -u' to remove duplicates, - 'sort' + 'join' to compute set intersections, and - 'sort' + 'join -v' to compute set differences. In a couple of places I only wanted to remove duplicates. Since the coreutils don't offer a way to remove duplicates without sorting, I accepted the sorting as a side effect. This has led to inconsistencies: In some places, gnulib-tool sorts file names, and in other places it doesn't. [2] In order to avoid mistakes of this kind, I propose that coreutils at least gives a hint that removing duplicate lines from an unsorted list of lines is possible, in O(N log N) run time and O(1) space. The way to do so is described in [3]: $ cat data.txt | nl | sort -k 2 | uniq -f 1 | sort -n | sed -e 's/^[^ ]* //' or $ cat -n data.txt | sort --key=2.1 -b -u | sort -n | sed -e 's/^[^ ]* //' Other methods described in [3], such as counters maintained in an 'awk' or 'perl' process, or the 'unique' program that is part of the 'john' package [4], can be ignored, because they need O(N) space and are thus not usable for 40 GB large inputs [5]. * Would you accept a documentation change to the 'uniq' page that explains the alternative with the line numbering trick mentioned above? * What do you think about adding - a 'uniq' option -k / --keep-unsorted that implies unsorted input and removal of occurrences 2,...,n of a line that occurs n times, - a 'uniq' option --keep-unsorted-last that implies unsorted input and removal of occurrences 1,...,n-1 of a line that occurs n times? These options would be implemented by spawning a pipe of 'cat', 'sort', 'sort', 'sed' programs as shown above, optionally resorting to all-in-memory processing if the input's size is below a certain threshold (like 'sort' does). Bruno [1] https://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html [2] https://lists.gnu.org/archive/html/bug-gnulib/2024-03/msg00334.html [3] https://unix.stackexchange.com/questions/11939/ [4] https://www.openwall.com/john/ [5] https://stackoverflow.com/questions/30906191/