bug#32099: datamash wins! (Was: New uniq option for speed)

Kingsley G. Morse Jr. Wed, 11 Jul 2018 16:44:38 -0700

Hey Assaf,

Thank you very much for taking the time to share
your thoughts.


People like you, willing to improve free software
and help the common good, restore my faith in
humanity.


I followed much of your advice.

I ran more benchmarks: 

gawk, mawk, sort -u, sort | uniq and ... datamash rmdup!

Plus, I benchmarked them against a bigger file.

~1 GB!

Plus, plus, I measured how much memory they used!


The result?

Hashing algorithms always won.

I expected it.

When I benchmarked data compression apps, one
named "rzip" that searched its cache of unique
values with hashing was much faster.

    https://www.linuxjournal.com/article/8051


But, to be fair, I was surprised by how little
memory datamash seemed to use.

Maybe I screwed up.

I checked max resident set size of the process
during its lifetime with /usr/bin/time's "%M"
specifier.


Plus, plus, plus, I created a composite
performance criteria of 

    elapsed time * memory used

(It's in the attached spread sheet.)

That winner at ~1 GB?

datamash!

No matter whether the keys were unique or
duplicated, "datamash rmdup 1" won!

New files are attached.


Humble suggestion:

Don't worry about running out of memory.

Hashing could be added to the "uniq" command as...
(wait for it...) an option.

So you could always omit it and use sort.


For yours truly, the big gain is actually being
able to ditch sort.

Less is more.

It's a UNIX thing.


So it is said.

So it is done.

Sorting is great, but

less is number one!

~K

PS: Congratulations on datamash performing so
    well.

On 07/10/2018 12:21, Assaf Gordon wrote:
> tag 32099 notabug
> severity 32099 wishlist
> close 32099
> stop
> 
> Hello Kingsley,
> 
> On 09/07/18 10:29 PM, Kingsley G. Morse Jr. wrote:
> > And I like your benchmarks.
> > 
> > Mine are humbly attached.
> > 
> > They compare sorting to hashing.
> > 
> > At the moment, hashing seems to me to be about 70%
> > faster.
> > 
> > And to scale better.
> > 
> > I'd still like uniq to be faster and free from
> > sort.
> 
> Thank you for taking the time to create and provide these,
> including the scripts to reproduce it.
> 
> I've marked this item as "closed / wishlist" - but only because it
> is technically not a bug report. Discussion is welcomed to continue
> by replying to this thread.
> 
> Being "faster" alone is important but not the only criteria (as
> mentioned in the previous email's examples).
> 
> Your measurements are a good start towards stronger support to
> hash-based uniq implementation (a working patch, btw, would be even
> stronger).
> 
> I would suggest several improvements to the measurements, to ensure
> relevant information is conveyed:
> 
> 1.
> Specify exactly which versions of awk/sort you are using
> (e.g. which versions, whether from your distribution or compiled from
> source, if compiled than which compiler / optimization flags).
> What CPUs is being used, how much memory does the computer have, etc.
> 
> 2.
> I would suggest adding "sort | uniq" to the charts, because that is the
> real base-line you are trying to improve.
> I would also consider adding "datamash rmdup" to the charts, because its
> hash-based implementation could serve as a ball-park indication of how
> a naive hash implementation would work in uniq.
> 
> 3.
> You input is very small (up to 90K lines of 10 characters each) - that
> does not seem to stress-test any significant parts of the algorithms
> (and for a typical user, waiting 0.05 seconds vs 0.15 seconds does not
> matter).
> 
> I would recommend much larger input files, with much longer lines -
> trying to simulate various real-world scenarios.
> 
> Here's a project of someone who tried to compare various hashing
> functions: https://github.com/jhunt/hash/
> Under the "corpus" directory you'll find few demo data files.
> Even the largest one (qnames) is only 84MB - not very big.
> 
> Perhaps consider duplicating the content few times, then test it:
>   for i in $(seq 1 5);do cat qnames; done > input
> 
> To get even closer to real-world input, try several input files
> with different ratios of duplicated lines (your script create completely
> random data - not very similar to real-world input files).
> 
> For example, what happens with many lines share similar sub-strings?
> depending on the hashing implementation, this could lead to lower
> performance.
> 
> 4.
> It would be beneficial to see what's the memory requirements and limits
> of a hash-based implementation. Currently, because 'sort' is not limited
> by available RAM, there is no memory limit (though resorting to disk I/O
> might make everything slow).
> For example, is it possible to hash an 800M file on a machine with only 1GB
> of RAM ? and without additional code, hashing files larger than available
> memory is simply not possible - a very big disadvantage.
> 
> 5.
> To compare "sort' fairly against other options,
> I would add to the measurements using "--parallel 1" and later "--parallel
> N" (N being the number of cpu/cores you have).
> Does that improve performance?
> 
> The "awk" (or datamash) programs do not have a built-in memory limit,
> while sort does. It would be useful to specify a large amount of
> memory for sort (using "-S") to ensure sort did not revert to disk I/O.
> 
> 6.
> The reported time in your script is the elapsed time (aka "wall time").
> A more relevant option would be "user time" + "kernel time"
> (optiosn %S and %U in "time -f") - the wall time can be affected by
> factors that don't immediately relate to hashing/sorting performance.
> 
> 7.
> In your script you are providing a file-based input,
> it might be useful to clear the kernel's cache before each run
> to ensure these do not affect the results (especially when dealing with
> large files).
> See: 
> https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/
> 
> 
> 
> 
> To conclude,
> There is definitely merit in trying to optimize uniq's performance,
> but it must be weighed against other factors (e.g. the ability to 'uniq'
> a file larger than available memory, and the ease of using existing programs
> to achieve the same effect).
> 
> IMHO it is not a top-priority, so I don't want to give the impression
> that if an amazing speed improvement is presented, uniq will be swiftly
> rewritten.
> 
> But of course, a working patch and strong pro arguments would go a long
> way towards incorporating this feature.
> 
> regards,
>  - assaf
> 
> 
> 

-- 
Time is the fire in which we all burn.

# Report heading.
echo 'lines,line length,percent unique,algorithm,time(sec),memory(K)'

# Benchmark with increasing lines of data...
for data_lines in 1 10 100 1000 10000 100000 1000000 ; do

    # Benchmark with increasing line length..
    for line_length in 1 10 100 1000 ; do

        # Synchronize cached writes to persistent storage
        sync
        
        # Generate a certain number of lines of random data of a certain length
        tr -cd '[[:alnum:]]' < /dev/urandom | fold -bw "$line_length" | head -n 
"$data_lines" > /tmp/data

        # Calculate the percentage of lines that are unique.
        number_of_unique_lines="$( sort -u /tmp/data | wc -l )"
        percent_unique="$( calc -dp "$number_of_unique_lines/$data_lines" )"

        # Benchmark sort with its unique option.
        /usr/bin/time -o /tmp/sort_time -f %e,%M sort -u /tmp/data > /dev/null 
        echo "$data_lines,$line_length,$percent_unique,sort,$( cat 
/tmp/sort_time )"

        # Benchmark gawk hashing.
        /usr/bin/time -o /tmp/gawk_time -f %e,%M gawk '!seen[$0]++' /tmp/data  
> /dev/null
        echo "$data_lines,$line_length,$percent_unique,gawk,$( cat 
/tmp/gawk_time )"

        # Benchmark mawk hashing.
        /usr/bin/time -o /tmp/mawk_time -f %e,%M mawk '!seen[$0]++' /tmp/data  
> /dev/null
        echo "$data_lines,$line_length,$percent_unique,mawk,$( cat 
/tmp/mawk_time )"

        # Benchmark datamash.
        /usr/bin/time -o /tmp/datamash_time -f %e,%M cat /tmp/data | datamash 
rmdup 1 > /dev/null
        echo "$data_lines,$line_length,$percent_unique,datamash,$( cat 
/tmp/datamash_time )"

        # Benchmark a pipe of sort and unique.
        /usr/bin/time -o /tmp/sort_and_uniq_time -f %e,%M sort /tmp/data | uniq 
> /dev/null
        echo "$data_lines,$line_length,$percent_unique,sort and uniq,$( cat 
/tmp/sort_and_uniq_time )"

    done
done

hash_and_sort_benchmarks.2.gnumeric
Description: updated gnumeric spread sheet

lines,line length,percent unique,algorithm,time(sec),memory(K)
1,1,1,sort,0.00,1848
1,1,1,gawk,0.00,3424
1,1,1,mawk,0.00,1780
1,1,1,datamash,0.00,1580
1,1,1,sort and uniq,0.00,1828
1,10,1,sort,0.00,1916
1,10,1,gawk,0.00,3352
1,10,1,mawk,0.00,1832
1,10,1,datamash,0.00,1560
1,10,1,sort and uniq,0.00,1828
1,100,1,sort,0.00,1940
1,100,1,gawk,0.00,3324
1,100,1,mawk,0.00,1680
1,100,1,datamash,0.00,1564
1,100,1,sort and uniq,0.00,1904
1,1000,1,sort,0.00,1788
1,1000,1,gawk,0.00,3312
1,1000,1,mawk,0.00,1740
1,1000,1,datamash,0.00,1704
1,1000,1,sort and uniq,0.00,1916
10,1,0.9,sort,0.00,1780
10,1,0.9,gawk,0.00,3412
10,1,0.9,mawk,0.00,1628
10,1,0.9,datamash,0.00,1584
10,1,0.9,sort and uniq,0.00,1840
10,10,1,sort,0.00,1864
10,10,1,gawk,0.00,3444
10,10,1,mawk,0.00,1792
10,10,1,datamash,0.00,1616
10,10,1,sort and uniq,0.00,1928
10,100,1,sort,0.00,1792
10,100,1,gawk,0.00,3252
10,100,1,mawk,0.00,1784
10,100,1,datamash,0.00,1620
10,100,1,sort and uniq,0.00,1964
10,1000,1,sort,0.00,1972
10,1000,1,gawk,0.00,3248
10,1000,1,mawk,0.00,1772
10,1000,1,datamash,0.00,1640
10,1000,1,sort and uniq,0.00,1948
100,1,0.49,sort,0.00,1920
100,1,0.49,gawk,0.00,3516
100,1,0.49,mawk,0.00,1636
100,1,0.49,datamash,0.00,1616
100,1,0.49,sort and uniq,0.00,1828
100,10,1,sort,0.00,1968
100,10,1,gawk,0.00,3548
100,10,1,mawk,0.00,1696
100,10,1,datamash,0.00,1668
100,10,1,sort and uniq,0.00,1924
100,100,1,sort,0.00,1916
100,100,1,gawk,0.00,3536
100,100,1,mawk,0.00,1776
100,100,1,datamash,0.00,1632
100,100,1,sort and uniq,0.00,1928
100,1000,1,sort,0.00,1868
100,1000,1,gawk,0.00,3544
100,1000,1,mawk,0.00,1852
100,1000,1,datamash,0.00,1664
100,1000,1,sort and uniq,0.00,1784
1000,1,0.064,sort,0.00,1796
1000,1,0.064,gawk,0.00,3464
1000,1,0.064,mawk,0.00,1688
1000,1,0.064,datamash,0.00,1564
1000,1,0.064,sort and uniq,0.00,1920
1000,10,1,sort,0.00,1876
1000,10,1,gawk,0.00,3476
1000,10,1,mawk,0.00,1816
1000,10,1,datamash,0.00,1708
1000,10,1,sort and uniq,0.00,1804
1000,100,1,sort,0.00,2020
1000,100,1,gawk,0.00,3488
1000,100,1,mawk,0.00,1984
1000,100,1,datamash,0.00,1644
1000,100,1,sort and uniq,0.00,1948
1000,1000,1,sort,0.00,2724
1000,1000,1,gawk,0.00,4620
1000,1000,1,mawk,0.00,2876
1000,1000,1,datamash,0.01,1568
1000,1000,1,sort and uniq,0.01,2836
10000,1,0.0064,sort,0.01,1880
10000,1,0.0064,gawk,0.00,3564
10000,1,0.0064,mawk,0.00,1628
10000,1,0.0064,datamash,0.00,1616
10000,1,0.0064,sort and uniq,0.01,1876
10000,10,1,sort,0.03,1944
10000,10,1,gawk,0.02,4904
10000,10,1,mawk,0.00,2484
10000,10,1,datamash,0.01,1664
10000,10,1,sort and uniq,0.02,1884
10000,100,1,sort,0.02,3076
10000,100,1,gawk,0.01,5636
10000,100,1,mawk,0.00,3352
10000,100,1,datamash,0.02,1564
10000,100,1,sort and uniq,0.03,3036
10000,1000,1,sort,0.05,11644
10000,1000,1,gawk,0.06,14328
10000,1000,1,mawk,0.03,12232
10000,1000,1,datamash,0.09,1728
10000,1000,1,sort and uniq,0.10,11772
100000,1,0.00064,sort,0.20,4252
100000,1,0.00064,gawk,0.04,3332
100000,1,0.00064,mawk,0.01,1756
100000,1,0.00064,datamash,0.01,1728
100000,1,0.00064,sort and uniq,0.21,4252
100000,10,1,sort,0.29,5192
100000,10,1,gawk,0.14,17040
100000,10,1,mawk,0.13,8284
100000,10,1,datamash,0.04,1520
100000,10,1,sort and uniq,0.27,5188
100000,100,1,sort,0.34,13904
100000,100,1,gawk,0.18,26404
100000,100,1,mawk,0.15,17348
100000,100,1,datamash,0.10,1708
100000,100,1,sort and uniq,0.34,13836
100000,1000,1,sort,0.58,101788
100000,1000,1,gawk,0.62,113672
100000,1000,1,mawk,0.42,105888
100000,1000,1,datamash,0.73,1564
100000,1000,1,sort and uniq,0.84,101720
1000000,1,0.000064,sort,1.29,34940
1000000,1,0.000064,gawk,0.44,3304
1000000,1,0.000064,mawk,0.15,1752
1000000,1,0.000064,datamash,0.10,1568
1000000,1,0.000064,sort and uniq,1.33,35088
1000000,10,1,sort,2.66,43728
1000000,10,1,gawk,1.65,138700
1000000,10,1,mawk,1.79,65772
1000000,10,1,datamash,0.44,1620
1000000,10,1,sort and uniq,2.37,44008
1000000,100,1,sort,2.91,131600
1000000,100,1,gawk,2.10,232412
1000000,100,1,mawk,2.32,157876
1000000,100,1,datamash,1.23,1704
1000000,100,1,sort and uniq,3.64,131592
1000000,1000,1,sort,5.36,1010664
1000000,1000,1,gawk,6.53,1107508
1000000,1000,1,mawk,5.23,1042296
1000000,1000,1,datamash,7.79,1620
1000000,1000,1,sort and uniq,7.76,1010804

bug#32099: datamash wins! (Was: New uniq option for speed)

Reply via email to