Hey Berny, I like your suggestion of testing whether hashing interferes with any other option.
I was glad to see the POSIX standard doesn't explicitly require sorted input or output. If someone writes the patch, I should be able to test it, at least on up to 1 GB of input. So, Kingsley On 07/12/2018 08:43, Bernhard Voelker wrote: > On 07/10/2018 08:21 PM, Assaf Gordon wrote: > > I would suggest several improvements to the measurements, to ensure > > relevant information is conveyed: > > > > 1. > > [...] > > great summary. With the risk of mentioning already said aspects: > > 7. Consider all the existing options, i.e. processing modes, of 'uniq' as > well. > This means, it has to be considered (upfront!) whether introducing an > alternate > way to do things - in this case hashing - only serves for the trivial case, > or whether this would slow down or even contradict the processing with other > options, currently: > > -c, --count prefix lines by the number of occurrences > -d, --repeated only print duplicate lines, one for each group > -D print all duplicate lines > --all-repeated[=METHOD] like -D, but allow separating groups > with an empty line; > METHOD={none(default),prepend,separate} > -f, --skip-fields=N avoid comparing the first N fields > --group[=METHOD] show all items, separating groups with an empty line; > METHOD={separate(default),prepend,append,both} > -i, --ignore-case ignore differences in case when comparing > -s, --skip-chars=N avoid comparing the first N characters > -u, --unique only print unique lines > -z, --zero-terminated line delimiter is NUL, not newline > -w, --check-chars=N compare no more than N characters in lines > > Without deeper thinking about it, especially the combinations with > the --group, --repeated and --unique options might be problematic. > > 8. Standards. POSIX says: > http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html > > The uniq utility shall read an input file comparing adjacent lines, > and write one copy of each input line on the output. The second and > succeeding copies of repeated adjacent input lines shall not be written. > > This makes an assumption both on the input and output. > > Regarding the input, this means that the processing just has to remember > the previous line in order to decide whether to print/discard it in the > output. Therefore, arbitrary input size is fine. > Hashing most probably will have issues with arbitrary input size; > I do not talk about 1GB files, but _much_ larger files (yes, we > are in 2018 where 1GB is like nothing). > > Regarding the output, this means that the output is implicitly in > sort order as well. Like the input, uniq can discard the already > written data from memory because it is sure that uniq doesn't need > to consider it anymore. > > > Thus said, a successful optimization idea does not only have to show > that it is faster or needs less resources in _some_ cases, but also > must prove that it does not tear down many cases including extreme > ones. The current implementation might be as-is for a good reason. > If it turns out that the optimization idea screws up a single > of the above use cases, then the dilemma is whether 2 implementations > are warranted to be kept (maintenance!), and whether it is possible > to detect the extreme cases early enough to switch from the default > to the other processing strategy. > > https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good > or D. Knuth: > "premature optimization is the root of all evil > (or at least most of it) in programming" > > As Assaf said, a patch with a proof of concept would be helpful ... even > if it's just helpful to proof that the current implementation is fine. > > Have a nice day, > Berny -- Time is the fire in which we all burn.