Hey Assaf, Thank you very much for taking the time to share your thoughts.
People like you, willing to improve free software and help the common good, restore my faith in humanity. I followed much of your advice. I ran more benchmarks: gawk, mawk, sort -u, sort | uniq and ... datamash rmdup! Plus, I benchmarked them against a bigger file. ~1 GB! Plus, plus, I measured how much memory they used! The result? Hashing algorithms always won. I expected it. When I benchmarked data compression apps, one named "rzip" that searched its cache of unique values with hashing was much faster. https://www.linuxjournal.com/article/8051 But, to be fair, I was surprised by how little memory datamash seemed to use. Maybe I screwed up. I checked max resident set size of the process during its lifetime with /usr/bin/time's "%M" specifier. Plus, plus, plus, I created a composite performance criteria of elapsed time * memory used (It's in the attached spread sheet.) That winner at ~1 GB? datamash! No matter whether the keys were unique or duplicated, "datamash rmdup 1" won! New files are attached. Humble suggestion: Don't worry about running out of memory. Hashing could be added to the "uniq" command as... (wait for it...) an option. So you could always omit it and use sort. For yours truly, the big gain is actually being able to ditch sort. Less is more. It's a UNIX thing. So it is said. So it is done. Sorting is great, but less is number one! ~K PS: Congratulations on datamash performing so well. On 07/10/2018 12:21, Assaf Gordon wrote: > tag 32099 notabug > severity 32099 wishlist > close 32099 > stop > > Hello Kingsley, > > On 09/07/18 10:29 PM, Kingsley G. Morse Jr. wrote: > > And I like your benchmarks. > > > > Mine are humbly attached. > > > > They compare sorting to hashing. > > > > At the moment, hashing seems to me to be about 70% > > faster. > > > > And to scale better. > > > > I'd still like uniq to be faster and free from > > sort. > > Thank you for taking the time to create and provide these, > including the scripts to reproduce it. > > I've marked this item as "closed / wishlist" - but only because it > is technically not a bug report. Discussion is welcomed to continue > by replying to this thread. > > Being "faster" alone is important but not the only criteria (as > mentioned in the previous email's examples). > > Your measurements are a good start towards stronger support to > hash-based uniq implementation (a working patch, btw, would be even > stronger). > > I would suggest several improvements to the measurements, to ensure > relevant information is conveyed: > > 1. > Specify exactly which versions of awk/sort you are using > (e.g. which versions, whether from your distribution or compiled from > source, if compiled than which compiler / optimization flags). > What CPUs is being used, how much memory does the computer have, etc. > > 2. > I would suggest adding "sort | uniq" to the charts, because that is the > real base-line you are trying to improve. > I would also consider adding "datamash rmdup" to the charts, because its > hash-based implementation could serve as a ball-park indication of how > a naive hash implementation would work in uniq. > > 3. > You input is very small (up to 90K lines of 10 characters each) - that > does not seem to stress-test any significant parts of the algorithms > (and for a typical user, waiting 0.05 seconds vs 0.15 seconds does not > matter). > > I would recommend much larger input files, with much longer lines - > trying to simulate various real-world scenarios. > > Here's a project of someone who tried to compare various hashing > functions: https://github.com/jhunt/hash/ > Under the "corpus" directory you'll find few demo data files. > Even the largest one (qnames) is only 84MB - not very big. > > Perhaps consider duplicating the content few times, then test it: > for i in $(seq 1 5);do cat qnames; done > input > > To get even closer to real-world input, try several input files > with different ratios of duplicated lines (your script create completely > random data - not very similar to real-world input files). > > For example, what happens with many lines share similar sub-strings? > depending on the hashing implementation, this could lead to lower > performance. > > 4. > It would be beneficial to see what's the memory requirements and limits > of a hash-based implementation. Currently, because 'sort' is not limited > by available RAM, there is no memory limit (though resorting to disk I/O > might make everything slow). > For example, is it possible to hash an 800M file on a machine with only 1GB > of RAM ? and without additional code, hashing files larger than available > memory is simply not possible - a very big disadvantage. > > 5. > To compare "sort' fairly against other options, > I would add to the measurements using "--parallel 1" and later "--parallel > N" (N being the number of cpu/cores you have). > Does that improve performance? > > The "awk" (or datamash) programs do not have a built-in memory limit, > while sort does. It would be useful to specify a large amount of > memory for sort (using "-S") to ensure sort did not revert to disk I/O. > > 6. > The reported time in your script is the elapsed time (aka "wall time"). > A more relevant option would be "user time" + "kernel time" > (optiosn %S and %U in "time -f") - the wall time can be affected by > factors that don't immediately relate to hashing/sorting performance. > > 7. > In your script you are providing a file-based input, > it might be useful to clear the kernel's cache before each run > to ensure these do not affect the results (especially when dealing with > large files). > See: > https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/ > > > > > To conclude, > There is definitely merit in trying to optimize uniq's performance, > but it must be weighed against other factors (e.g. the ability to 'uniq' > a file larger than available memory, and the ease of using existing programs > to achieve the same effect). > > IMHO it is not a top-priority, so I don't want to give the impression > that if an amazing speed improvement is presented, uniq will be swiftly > rewritten. > > But of course, a working patch and strong pro arguments would go a long > way towards incorporating this feature. > > regards, > - assaf > > > -- Time is the fire in which we all burn.
# Report heading. echo 'lines,line length,percent unique,algorithm,time(sec),memory(K)' # Benchmark with increasing lines of data... for data_lines in 1 10 100 1000 10000 100000 1000000 ; do # Benchmark with increasing line length.. for line_length in 1 10 100 1000 ; do # Synchronize cached writes to persistent storage sync # Generate a certain number of lines of random data of a certain length tr -cd '[[:alnum:]]' < /dev/urandom | fold -bw "$line_length" | head -n "$data_lines" > /tmp/data # Calculate the percentage of lines that are unique. number_of_unique_lines="$( sort -u /tmp/data | wc -l )" percent_unique="$( calc -dp "$number_of_unique_lines/$data_lines" )" # Benchmark sort with its unique option. /usr/bin/time -o /tmp/sort_time -f %e,%M sort -u /tmp/data > /dev/null echo "$data_lines,$line_length,$percent_unique,sort,$( cat /tmp/sort_time )" # Benchmark gawk hashing. /usr/bin/time -o /tmp/gawk_time -f %e,%M gawk '!seen[$0]++' /tmp/data > /dev/null echo "$data_lines,$line_length,$percent_unique,gawk,$( cat /tmp/gawk_time )" # Benchmark mawk hashing. /usr/bin/time -o /tmp/mawk_time -f %e,%M mawk '!seen[$0]++' /tmp/data > /dev/null echo "$data_lines,$line_length,$percent_unique,mawk,$( cat /tmp/mawk_time )" # Benchmark datamash. /usr/bin/time -o /tmp/datamash_time -f %e,%M cat /tmp/data | datamash rmdup 1 > /dev/null echo "$data_lines,$line_length,$percent_unique,datamash,$( cat /tmp/datamash_time )" # Benchmark a pipe of sort and unique. /usr/bin/time -o /tmp/sort_and_uniq_time -f %e,%M sort /tmp/data | uniq > /dev/null echo "$data_lines,$line_length,$percent_unique,sort and uniq,$( cat /tmp/sort_and_uniq_time )" done done
hash_and_sort_benchmarks.2.gnumeric
Description: updated gnumeric spread sheet
lines,line length,percent unique,algorithm,time(sec),memory(K) 1,1,1,sort,0.00,1848 1,1,1,gawk,0.00,3424 1,1,1,mawk,0.00,1780 1,1,1,datamash,0.00,1580 1,1,1,sort and uniq,0.00,1828 1,10,1,sort,0.00,1916 1,10,1,gawk,0.00,3352 1,10,1,mawk,0.00,1832 1,10,1,datamash,0.00,1560 1,10,1,sort and uniq,0.00,1828 1,100,1,sort,0.00,1940 1,100,1,gawk,0.00,3324 1,100,1,mawk,0.00,1680 1,100,1,datamash,0.00,1564 1,100,1,sort and uniq,0.00,1904 1,1000,1,sort,0.00,1788 1,1000,1,gawk,0.00,3312 1,1000,1,mawk,0.00,1740 1,1000,1,datamash,0.00,1704 1,1000,1,sort and uniq,0.00,1916 10,1,0.9,sort,0.00,1780 10,1,0.9,gawk,0.00,3412 10,1,0.9,mawk,0.00,1628 10,1,0.9,datamash,0.00,1584 10,1,0.9,sort and uniq,0.00,1840 10,10,1,sort,0.00,1864 10,10,1,gawk,0.00,3444 10,10,1,mawk,0.00,1792 10,10,1,datamash,0.00,1616 10,10,1,sort and uniq,0.00,1928 10,100,1,sort,0.00,1792 10,100,1,gawk,0.00,3252 10,100,1,mawk,0.00,1784 10,100,1,datamash,0.00,1620 10,100,1,sort and uniq,0.00,1964 10,1000,1,sort,0.00,1972 10,1000,1,gawk,0.00,3248 10,1000,1,mawk,0.00,1772 10,1000,1,datamash,0.00,1640 10,1000,1,sort and uniq,0.00,1948 100,1,0.49,sort,0.00,1920 100,1,0.49,gawk,0.00,3516 100,1,0.49,mawk,0.00,1636 100,1,0.49,datamash,0.00,1616 100,1,0.49,sort and uniq,0.00,1828 100,10,1,sort,0.00,1968 100,10,1,gawk,0.00,3548 100,10,1,mawk,0.00,1696 100,10,1,datamash,0.00,1668 100,10,1,sort and uniq,0.00,1924 100,100,1,sort,0.00,1916 100,100,1,gawk,0.00,3536 100,100,1,mawk,0.00,1776 100,100,1,datamash,0.00,1632 100,100,1,sort and uniq,0.00,1928 100,1000,1,sort,0.00,1868 100,1000,1,gawk,0.00,3544 100,1000,1,mawk,0.00,1852 100,1000,1,datamash,0.00,1664 100,1000,1,sort and uniq,0.00,1784 1000,1,0.064,sort,0.00,1796 1000,1,0.064,gawk,0.00,3464 1000,1,0.064,mawk,0.00,1688 1000,1,0.064,datamash,0.00,1564 1000,1,0.064,sort and uniq,0.00,1920 1000,10,1,sort,0.00,1876 1000,10,1,gawk,0.00,3476 1000,10,1,mawk,0.00,1816 1000,10,1,datamash,0.00,1708 1000,10,1,sort and uniq,0.00,1804 1000,100,1,sort,0.00,2020 1000,100,1,gawk,0.00,3488 1000,100,1,mawk,0.00,1984 1000,100,1,datamash,0.00,1644 1000,100,1,sort and uniq,0.00,1948 1000,1000,1,sort,0.00,2724 1000,1000,1,gawk,0.00,4620 1000,1000,1,mawk,0.00,2876 1000,1000,1,datamash,0.01,1568 1000,1000,1,sort and uniq,0.01,2836 10000,1,0.0064,sort,0.01,1880 10000,1,0.0064,gawk,0.00,3564 10000,1,0.0064,mawk,0.00,1628 10000,1,0.0064,datamash,0.00,1616 10000,1,0.0064,sort and uniq,0.01,1876 10000,10,1,sort,0.03,1944 10000,10,1,gawk,0.02,4904 10000,10,1,mawk,0.00,2484 10000,10,1,datamash,0.01,1664 10000,10,1,sort and uniq,0.02,1884 10000,100,1,sort,0.02,3076 10000,100,1,gawk,0.01,5636 10000,100,1,mawk,0.00,3352 10000,100,1,datamash,0.02,1564 10000,100,1,sort and uniq,0.03,3036 10000,1000,1,sort,0.05,11644 10000,1000,1,gawk,0.06,14328 10000,1000,1,mawk,0.03,12232 10000,1000,1,datamash,0.09,1728 10000,1000,1,sort and uniq,0.10,11772 100000,1,0.00064,sort,0.20,4252 100000,1,0.00064,gawk,0.04,3332 100000,1,0.00064,mawk,0.01,1756 100000,1,0.00064,datamash,0.01,1728 100000,1,0.00064,sort and uniq,0.21,4252 100000,10,1,sort,0.29,5192 100000,10,1,gawk,0.14,17040 100000,10,1,mawk,0.13,8284 100000,10,1,datamash,0.04,1520 100000,10,1,sort and uniq,0.27,5188 100000,100,1,sort,0.34,13904 100000,100,1,gawk,0.18,26404 100000,100,1,mawk,0.15,17348 100000,100,1,datamash,0.10,1708 100000,100,1,sort and uniq,0.34,13836 100000,1000,1,sort,0.58,101788 100000,1000,1,gawk,0.62,113672 100000,1000,1,mawk,0.42,105888 100000,1000,1,datamash,0.73,1564 100000,1000,1,sort and uniq,0.84,101720 1000000,1,0.000064,sort,1.29,34940 1000000,1,0.000064,gawk,0.44,3304 1000000,1,0.000064,mawk,0.15,1752 1000000,1,0.000064,datamash,0.10,1568 1000000,1,0.000064,sort and uniq,1.33,35088 1000000,10,1,sort,2.66,43728 1000000,10,1,gawk,1.65,138700 1000000,10,1,mawk,1.79,65772 1000000,10,1,datamash,0.44,1620 1000000,10,1,sort and uniq,2.37,44008 1000000,100,1,sort,2.91,131600 1000000,100,1,gawk,2.10,232412 1000000,100,1,mawk,2.32,157876 1000000,100,1,datamash,1.23,1704 1000000,100,1,sort and uniq,3.64,131592 1000000,1000,1,sort,5.36,1010664 1000000,1000,1,gawk,6.53,1107508 1000000,1000,1,mawk,5.23,1042296 1000000,1000,1,datamash,7.79,1620 1000000,1000,1,sort and uniq,7.76,1010804