Re: uniq for unsorted input

Kaz Kylheku Tue, 26 Mar 2024 20:08:19 -0700

On 2024-03-26 09:40, Bruno Haible wrote: 
> Other methods described in [3], such as counters maintained in an 'awk'
> or 'perl' process, or the 'unique' program that is part of the 'john' package
> [4], can be ignored, because they need O(N) space and are thus not usable for
> 40 GB large inputs [5].


Are approaches that use a file (e.g. hashed key/value store on disk)
instead of virtual memory also off the table?

That must be acceptable since you find a solution acceptable that
involves sort, which you're willing to suffer twice. :)

>   These options would be implemented by spawning a pipe of 'cat', 'sort',
>   'sort', 'sed' programs as shown above, optionally resorting to all-in-memory
>   processing if the input's size is below a certain threshold (like 'sort'
>   does).

I just have the remark that perhaps unsorted uniq should be done
properly from the ground up, with its own temporary file strategy and
logic for switching to it above a certain memory use.

The proposed pipelined implementation (which seems like a good idea
for prototyping the feature upfront, and even releasing it) can then
be replaced.

The pipelined implementation can be used to develop all the valuable
test cases for the feature, so that the development of the real
solution can be banged out with someone's eyes closed. :)

Re: uniq for unsorted input

Reply via email to