On Wed, Jan 15, 2025 at 3:20 PM Pádraig Brady <p...@draigbrady.com> wrote: > Might coreutils csplit be a better place for this, > given the split is dependent on the content?
I can argue for both sides :-) I've picked split over csplit as CDC treats input as a stream of bytes and targets specific output size like split does. The largest sematic bit in split is a single-byte record delimiter. csplit is more line- and regexp-oriented, so it's less of a byte-level processing. Meanwhile, I totally agree that there is certain overlap between split and csplit goals and there might be some desire to have some hash-based behavior for csplit as well. However, I'd rather consider adding code to support patterns like these: $ split ... --separator ',\n' ... # multi-byte string, usual JSONL separator $ split ... --separator '</subdoc>' ... # another somewhat common multi-byte string $ split ... --separator <(grep --byte-offset ...) ... # grep is good at regexps :-) So the power of grep might be used to specify _potential_ cut points and some CDC hash might be used to pick a subset of those cuts. -- WBRBW, Leonid Evdokimov, https://darkk.net.ru tel:+79816800702 PGP: 6691 DE6B 4CCD C1C1 76A0 0D4A E1F2 A980 7F50 FAB2