bug#32099: Testing with other options (Was: Benchmarks: Hashing ~70% faster )

Kingsley G. Morse Jr. Fri, 13 Jul 2018 20:05:14 -0700

Hey Berny,

I like your suggestion of testing whether hashing
interferes with any other option.


I was glad to see the POSIX standard doesn't
explicitly require sorted input or output.

If someone writes the patch, I should be able to
test it, at least on up to 1 GB of input.

So,
Kingsley

On 07/12/2018 08:43, Bernhard Voelker wrote:
> On 07/10/2018 08:21 PM, Assaf Gordon wrote:
> > I would suggest several improvements to the measurements, to ensure
> > relevant information is conveyed:
> > 
> > 1.
> > [...]
> 
> great summary.  With the risk of mentioning already said aspects:
> 
> 7. Consider all the existing options, i.e. processing modes, of 'uniq' as 
> well.
> This means, it has to be considered (upfront!) whether introducing an 
> alternate
> way to do things - in this case hashing - only serves for the trivial case,
> or whether this would slow down or even contradict the processing with other
> options, currently:
> 
>   -c, --count           prefix lines by the number of occurrences
>   -d, --repeated        only print duplicate lines, one for each group
>   -D                    print all duplicate lines
>       --all-repeated[=METHOD]  like -D, but allow separating groups
>                                  with an empty line;
>                                  METHOD={none(default),prepend,separate}
>   -f, --skip-fields=N   avoid comparing the first N fields
>       --group[=METHOD]  show all items, separating groups with an empty line;
>                           METHOD={separate(default),prepend,append,both}
>   -i, --ignore-case     ignore differences in case when comparing
>   -s, --skip-chars=N    avoid comparing the first N characters
>   -u, --unique          only print unique lines
>   -z, --zero-terminated     line delimiter is NUL, not newline
>   -w, --check-chars=N   compare no more than N characters in lines
> 
> Without deeper thinking about it, especially the combinations with
> the --group, --repeated and --unique options might be problematic.
> 
> 8. Standards.  POSIX says:
>   http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html
> 
>   The uniq utility shall read an input file comparing adjacent lines,
>   and write one copy of each input line on the output. The second and
>   succeeding copies of repeated adjacent input lines shall not be written.
> 
> This makes an assumption both on the input and output.
> 
> Regarding the input, this means that the processing just has to remember
> the previous line in order to decide whether to print/discard it in the
> output.  Therefore, arbitrary input size is fine.
> Hashing most probably will have issues with arbitrary input size;
> I do not talk about 1GB files, but _much_ larger files (yes, we
> are in 2018 where 1GB is like nothing).
> 
> Regarding the output, this means that the output is implicitly in
> sort order as well.  Like the input, uniq can discard the already
> written data from memory because it is sure that uniq doesn't need
> to consider it anymore.
> 
> 
> Thus said, a successful optimization idea does not only have to show
> that it is faster or needs less resources in _some_ cases, but also
> must prove that it does not tear down many cases including extreme
> ones.  The current implementation might be as-is for a good reason.
> If it turns out that the optimization idea screws up a single
> of the above use cases, then the dilemma is whether 2 implementations
> are warranted to be kept (maintenance!), and whether it is possible
> to detect the extreme cases early enough to switch from the default
> to the other processing strategy.
> 
> https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good
> or D. Knuth:
>   "premature optimization is the root of all evil
>    (or at least most of it) in programming"
> 
> As Assaf said, a patch with a proof of concept would be helpful ... even
> if it's just helpful to proof that the current implementation is fine.
> 
> Have a nice day,
> Berny

-- 
Time is the fire in which we all burn.

bug#32099: Testing with other options (Was: Benchmarks: Hashing ~70% faster )

Reply via email to