Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N

Pádraig Brady Mon, 20 Jan 2025 08:03:46 -0800

On 20/01/2025 12:51, Leonid Evdokimov wrote:

On Wed, Jan 15, 2025 at 3:20 PM Pádraig Brady <p...@draigbrady.com> wrote:

This might indeed be general enough for coreutils.


What do you think about "API stability guarantees"?

I think that the aforementioned --hash-seed is a bit of a bold move
and, probably, the code should follow guarantees of `shuf`. As far as
I see, it's closer to "the same binary on the same platform behaves in
the same way given the same input, but not more". E.g. --random-source
may feed ISAAC and it behaves differently on 32-bit and 64-bit
platforms to the best of my understanding.

That also gives me freedom to pick slightly different CDC algorithms
depending on the desired chunking parameters. E.g. Gear-based CDC is
twice as fast as BUZ hash-based one, but it depends on a narrow
sliding window.


Well we don't document any platform dependent behavior with --random-source.
I'm not sure if that is the case, but we should document it if platform 
dependent.

It would be good to keep a consistent --random-source interface if possible.
Note we documented a seeded interface through --random-source at:
https://www.gnu.org/s/coreutils/manual/html_node/Random-sources.html

Hash Judgement function also has at least three options:
1) bitmask-based is the fastest one, but it needs the chunk size to be
exactly a power of 2


For 2x we might at least advise/default somehow to using a power of 2

2) fastrange() by Daniel Lemire, is ~2 times slower, but it needs
access to 64x64-to-128 multiplication that might be absent on some
platforms


We already have platform dependent code in coreutils.
Look at the use of pclmul in cksum and how it's separated at build/run time.

3) old-fashioned $(hash % mod) is ~20 times slower, but it has no requirements


Ideally we wouldn't have significantly different performance
based solely on chunk size at least. If we had to round to nearest power of 2
and the bitmask-based one was cross platform, then this sounds like
a reasonable constraint I think.

cheers,
Pádraig

Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N

Reply via email to