Csepp <raingl...@riseup.net> writes:
> I have a question / suggestion about the distributed substitutes > project: would downloads be split into uniformly sized chunks or could > the sizes vary? For the proposal that uses ERIS (https://issues.guix.gnu.org/52555) the chunks are uniformly sized (32KiB). > Specifically, in an extreme case where an update introduced a single > extra byte at the beginning of a file, would that result in completely > new chunks? Yes, that would be the case. ERIS uses fixed-block sizes and such extreme cases would result in completely new chunks - very bad de-duplication. The reason for using fixed-block sizes is security/privacy. When using variable sized blocks the sizes are observable by a potential censor and are also a function of the content itself. This leaks information about the transferred content. I believe there are documented cases of HTTPS connections being blocked/censored based on size of requests [citation needed]. This is something ERIS tries to prevent. That being said, I think there is still room for optimizing the de-duplication even with fixed-size blocks. > An alternative I've been thinking about is this: > find the store references in a file and split it along these references, > optionally apply further chunking to the non-reference blobs. > > It's probably best to do this at the NAR level?? I like the idea! If I understand correctly we would split whenever a store reference appears. When a single store reference changes (this probably happens quite often) then only the preceeding block changes. I think there is also a way to do something similar while preserving fixed size blocks: Maintain a lookup table for all store references appearing in a store item. When serializing this lookup table goes to the front (or back) with appropriate padding so that it is block aligned. All store references in the remaining serialization are replaced by a reference to the lookup table. Now when a store reference changes only the lookup table changes, the remaining content remains the same and is de-duplicated. A similar idea for also allowing de-duplication when individual files change: https://codeberg.org/eris/eer/src/branch/eris-fs/eer/eris-fs/index.md Also check out the Guix `wip-digests` branch. There are some related interesting ideas there. I'm working on rebasing and updating the decentralized substitute patches. Sorry for the slowness. They would at first only address block-wise transfer with a naive encoding that does not do very good de-duplication. As outlined I think de-duplication can be added later and I think it's great to start thinking about it and experimenting with ideas. -pukkamustard