Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N

Leonid Evdokimov Wed, 15 Jan 2025 04:49:24 -0800

On Wed, Jan 15, 2025 at 3:20 PM Pádraig Brady <[email protected]> wrote:
> Might coreutils csplit be a better place for this,
> given the split is dependent on the content?


I can argue for both sides :-)

I've picked split over csplit as CDC treats input as a stream of bytes
and targets specific output size like split does. The largest sematic
bit in split is a single-byte record delimiter. csplit is more line-
and regexp-oriented, so it's less of a byte-level processing.

Meanwhile, I totally agree that there is certain overlap between split
and csplit goals and there might be some desire to have some
hash-based behavior for csplit as well. However, I'd rather consider
adding code to support patterns like these:

$ split ... --separator ',\n' ... # multi-byte string, usual JSONL separator

$ split ... --separator '</subdoc>' ... # another somewhat common
multi-byte string

$ split ... --separator <(grep --byte-offset ...) ... # grep is good
at regexps :-)

So the power of grep might be used to specify _potential_ cut points
and some CDC hash might be used to pick a subset of those cuts.

-- 
WBRBW, Leonid Evdokimov, https://darkk.net.ru tel:+79816800702
PGP: 6691 DE6B 4CCD C1C1 76A0  0D4A E1F2 A980 7F50 FAB2

Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N

Reply via email to