Re: Extend uniq to support unsorted list based on hashtable

Assaf Gordon Fri, 29 May 2020 21:47:56 -0700

Hello,

On 2020-05-29 10:16 p.m., Yair Lenga wrote:

Wanted to suggest that the team will look (again) at implementing
--unsorted option for 'uniq'.


The idea was proposed (and rejected) about 10 years ago
(https://lists.gnu.org/archive/html/coreutils/2011-11/msg00016.html).
Lot of things have changed from the past.

[...]


Can you advise/provide feedback. I'm sure that there will be many
volunteers (me included) to contribute to such important improvement.


"uniq" is standardize by POSIX to work on "comparing adjacent lines"

(from:https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html ) -hence the requirement to pre-sort the input.


While it could be extended with a completely different hash-based
implementation, I don't think this is likely to happen.

As an alternative (and a shameless plug), allow me to point to
GNU Datamash ( https://www.gnu.org/software/datamash/ ).
On one hand, it already has a hash-based implementation to
remove duplicated fields (called "rmdup").
consider the following contrived example:

  $ (printf "%s\t%s\n" 9 B 3 A ; seq 10 | paste - -) | datamash rmdup 1
  9     B
  3     A
  1     2
  5     6
  7     8

And on the other hand, because 'datamash' is non-standard,
there's less of a problem in adding new functionality (i.e. "bloat" is
not as big as a concern as it is for coreutils).

Hope this helps.

regards,
 - assaf

Re: Extend uniq to support unsorted list based on hashtable

Reply via email to