On Friday, 19 February 2021 at 00:13:19 UTC, Jon Degenhardt wrote:
On Wednesday, 17 February 2021 at 04:10:24 UTC, tsbockman wrote:
I spent some time experimenting with this problem, and here is the best solution I found, assuming that perfect de-duplication is required. (I'll put the code up on GitHub / dub if anyone wants to have a look.)

It would be interesting to see how the performance compares to tsv-uniq (https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The prebuilt binaries turn on all the optimizations (https://github.com/eBay/tsv-utils/releases).

My program (called line-dedup below) is modestly faster than yours, with the gap gradually widening as files get bigger. Similarly, when not using a memory-mapped scratch file, my program is modestly less memory hungry than yours, with the gap gradually widening as files get bigger.

In neither case is the difference very exciting though; the real benefit of my algorithm is that it can process files too large for physical memory. It might also handle frequent hash collisions better, and could be upgraded to handle huge numbers of very short lines efficiently.

Test 0 (139 MiB, 1537300 unique lines of 2000004):
    sort source | uniq > dest
        Wall time: 2.04 s
        User time: 9.34 s
        System time: 0.23 s
        Max resident set: 385 MiB
    tsv-uniq source > dest
        Wall time: 0.82 s
        User time: 0.73 s
        System time: 0.13 s
        Max resident set: 229 MiB
    line-dedup source dest
        Wall time: 0.58 s
        User time: 0.52 s
        System time 0.05 s
        Max resident set: 206 MiB

Test 1 (1.4 GiB, 14338280 unique lines of 20000004):
    sort source | uniq > dest
        Wall time: 22.9 s
        User time: 107 s
        System time: 2.16 s
        Max resident set: 3.76 GiB
    tsv-uniq source > dest
        Wall time: 9.94 s
        User time: 9.25 s
        System time: 1.3 s
        Max resident set: 2.34 GiB
    line-dedup source dest
        Wall time: 6.34 s
        User time: 5.88 s
        System time 0.44 s
        Max resident set: 1.98 GiB

Test 2 (4.2 GiB, 44379959 unique lines of 60000004):
    sort source | uniq > dest
        Wall time: 87.1 s
        User time: 352.3 s
        System time: 7.41 s
        Max resident set: 7.84 GiB
    tsv-uniq source > dest
        Wall time: 42.3 s
        User time: 40.7 s
        System time: 4.82 s
        Max resident set: 7.86 GiB
    line-dedup source dest
        Wall time: 23 s
        User time: 19.6 s
        System time 1.4 s
        Max resident set: 5.96 GiB

The test files have a fractal distribution, with many lines having a few duplicates, and a few lines having many duplicates.

Reply via email to