Hello pukkamustard! pukkamustard <pukkamust...@posteo.net> skribis:
> I looked into block boundaries with a "sliding hash" (re-compute a > short > hash for every byte read and choose boundaries when hash is > zero). This > would allow a higher degree of de-duplication, but I found this to be > a > bit "finicky" (and myself too impatient to tune and tweak this :). > > I settled on fixed block sizes, making the encoding faster and > preventing > information leaks based on block size. Yeah, sounds reasonable. (I evaluated the benefits of this and other approaches years ago, FWIW: <https://hal.inria.fr/hal-00187069/en>.) > An other idea to increase de-duplication: When encoding a directory, > align files to the ERIS block size. This would allows de-duplication > of > files across encoded images/directories. I guess that’d work, indeed. >> Do I get it right that the encoder currently keeps blocks in memory? > > By default when using `(eris-encode content)`, yes. The blocks are > stored into an alist. > > But the encoder is implemented as an SRFI-171 transducer that eagerly > emits (reduces) encoded blocks. So one could do this: > > (eris-encode content #:block-reducer my-backend) > > Where `my-backend` is a SRFI-171 reducer that takes care of the blocks > as soon as they are ready. The IPFS example implements a reducer that > stores blocks to IPFS. By default `eris-encode` just uses `rcons` from > `(srfi srfi-171)`. Ah, I see, that’s great! I’m not familiar with the transducer API so I always have to think twice (or more) about what’s going on; the flexibility it gives here is really nice. > The encoding transducer is state-full. But it only keeps references to > blocks in memory and at most log(n) at any moment, where n is the > number of blocks to encode. > > The decoding interface currently looks likes this: > > (eris-decode->bytevector eris-urn > (lambda (ref) (get-block-from-my-backend ref))) OK. >> Do you have plans to provide an interface to the storage backend so >> one >> can easily switch between in-memory, Datashards, IPFS, etc.? > > Currently the interface is a bit "low-level" - provide a SRFI-171 > reducer. This can definitely be improved and I'd be happy for ideas on > how to make this more ergonomic. Maybe that’s all we need after all. Maybe what would be nice is a couple of examples, like a high-level procedure or CLI that can insert or fetch from either (say) a local GDBM database or IPFS. That would illustrate integration with backends as well as the high-level API. Thanks! Ludo’.