Cojocaru Alexandru wrote: > Subject: [PATCH] cut: use only one data structure > > The current implementation of cut, uses a bit array, > an array of `struct range_pair's, and (when --output-delimiter > is specified) a hash_table. The new implementation will use > only an array of `struct range_pair's. > The old implementation is inefficient for the following reasons: > 1. When -b with a big num is specified, it allocates a lot of useless > memory for `printable_field'. > 2. When --output-delimiter is specified, it will allocate 31 buckets. > Even if a few ranges are specified. ... > -/* Given the list of field or byte range specifications FIELDSTR, set > - MAX_RANGE_ENDPOINT and allocate and initialize the PRINTABLE_FIELD > - array. If there is a right-open-ended range, set EOL_RANGE_START > - to its starting index. FIELDSTR should be composed of one or more > - numbers or ranges of numbers, separated by blanks or commas. > - Incomplete ranges may be given: '-m' means '1-m'; 'n-' means 'n' > - through end of line. Return true if FIELDSTR contains at least > - one field specification, false otherwise. */ > - > -/* FIXME-someday: What if the user wants to cut out the 1,000,000-th > - field of some huge input file? This function shouldn't have to > - allocate a table of a million bits just so we can test every > - field < 10^6 with an array dereference. Instead, consider using > - an adaptive approach: if the range of selected fields is too large, > - but only a few fields/byte-offsets are actually selected, use a > - hash table. If the range of selected fields is too large, and > - too many are selected, then resort to using the range-pairs (the > - 'rp' array) directly. */
Thanks for the patch. This is large enough that you'll have to file a copyright assignment. For details, see the "Copyright assignment" section in the file named HACKING. Have you considered performance in the common case? I suspect that a byte or field number larger than 1000 is not common. That is why, in the FIXME comment above, I suggested to use an adaptive approach. I had the feeling (don't remember if I profiled it) that testing a bit per input field would be more efficient than an in-range test. If you construct test cases and gather timing data, please do so in a reproducible manner and include details when you report back, so we can compare on different types of systems. Cache matters a lot, these days.