bug#13127: [PATCH] cut: use only one data strucutre

Jim Meyering Sun, 09 Dec 2012 12:45:38 -0800

Cojocaru Alexandru wrote:
> Subject: [PATCH] cut: use only one data structure
>
> The current implementation of cut, uses a bit array,
> an array of `struct range_pair's, and (when --output-delimiter
> is specified) a hash_table. The new implementation will use
> only an array of `struct range_pair's.
> The old implementation is inefficient for the following reasons:
>  1. When -b with a big num is specified, it allocates a lot of useless
>     memory for `printable_field'.
>  2. When --output-delimiter is specified, it will allocate 31 buckets.
>     Even if a few ranges are specified.
...
> -/* Given the list of field or byte range specifications FIELDSTR, set
> -   MAX_RANGE_ENDPOINT and allocate and initialize the PRINTABLE_FIELD
> -   array.  If there is a right-open-ended range, set EOL_RANGE_START
> -   to its starting index.  FIELDSTR should be composed of one or more
> -   numbers or ranges of numbers, separated by blanks or commas.
> -   Incomplete ranges may be given: '-m' means '1-m'; 'n-' means 'n'
> -   through end of line.  Return true if FIELDSTR contains at least
> -   one field specification, false otherwise.  */
> -
> -/* FIXME-someday:  What if the user wants to cut out the 1,000,000-th
> -   field of some huge input file?  This function shouldn't have to
> -   allocate a table of a million bits just so we can test every
> -   field < 10^6 with an array dereference.  Instead, consider using
> -   an adaptive approach: if the range of selected fields is too large,
> -   but only a few fields/byte-offsets are actually selected, use a
> -   hash table.  If the range of selected fields is too large, and
> -   too many are selected, then resort to using the range-pairs (the
> -   'rp' array) directly.  */


Thanks for the patch.
This is large enough that you'll have to file a copyright assignment.
For details, see the "Copyright assignment" section in the file
named HACKING.

Have you considered performance in the common case?
I suspect that a byte or field number larger than 1000 is
not common.  That is why, in the FIXME comment above,
I suggested to use an adaptive approach.  I had the feeling
(don't remember if I profiled it) that testing a bit per
input field would be more efficient than an in-range test.

If you construct test cases and gather timing data, please do so
in a reproducible manner and include details when you report back,
so we can compare on different types of systems.  Cache matters a
lot, these days.

bug#13127: [PATCH] cut: use only one data strucutre

Reply via email to