On Tue, 2010-08-24 at 13:42 +0100, Ed W wrote: > Hi > > > The idea is to have dbox and mdbox support saving attachments (or MIME > > parts in general) to separate files, which with some magic gives a > > possibility to do single instance attachment storage. Comments welcome. > > This is a really interesting idea. I have previously given it some > thought. My 2p > > 1) Being able to ask "the server" if it has an attachment matching a > specific hash would be useful for a bunch of other reasons.
If you have a hash 351641b73feb7cf7e87e5a8c3ca9a37d7b21e525, you can see if it exists with: ls /attachments/35/16/hashes/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525 > This result > needs to be (crytographically) unique and hence the hash needs to be a > good hash (MD5/SHA or better) of the complete attachment, Currently it uses SHA1, but this can be changed anytime. I didn't bother to make it configurable. The hash's security isn't a huge issue since it does byte-by-byte comparison anyway. > ideally after decoding The hash is after decoding base64, if attachment is saved decoded, and that happens if it can be re-encoded exactly as it was. > 2) It might be useful to be able to find attachments with a specific > hash regardless of whether the attachment has been spat out separately > (think of a use case where we want to be able to spot a 2KB footer gif > which on it's own isn't worth worrying about, but some offline scan > later discovers 90% of emails contain this gif and we wish to split it > off as a policy decision). I guess that would be possible, but it would require reading and parsing all of the mail files. That could take a while. The finding part wouldn't be all that much work, but separating attachments out of already saved mails is kind of annoying. > 3) Storing attachments by hash may be interesting for use with > specialist filesystems, eg an interesting direction that dbox could take > might be to store the headers and message text in some (compressed?) > format with high linear read rates and most attachments in a some > key/value storage system? The attachment I/O is done via filesystem API, so this would be possible easily by just writing FS API backend for a key-value database. > 4) Many modern IMAP clients are starting to download attachments on > demand. Need to be able to supply only parts of the email efficiently > without needing to pull in the blobs. Stated another way, it's > desirable not to peek inside the blobs to be able to fetch arbitrary > mime parts This is already done .. in theory anyway. I'm not sure yet if some prefetching code causes the attachments to be read unnecessarily. Should test it. > 5) It's going to be easy to break signed emails... Need to be careful Yeah, I wasn't planning on breaking them. > 6) In many cases this isn't a performance win... It's still a *great* > feature, but two disk seeks outweigh a lot of linear read speed. Sure, not a performance win. But that's not what it was meant for. :) But if only >1MB (or so) attachments were stored separately that should get rid of the worst offenders without impacting performance much. > 7) When something gets corrupted... It's worth pondering about how we > can audit and find unreferenced "blobs" later? Dovecot logs an error when it finds something unexpected. But there's not a whole lot it can do then. And finding such broken attachments .. well, I guess this'll already do it: doveadm fetch -A body all > /dev/null > Some of the use cases I have for these features (just in case you > care...). We have a feature which is a bit like the opposite of one of > these services for sending big attachments. When users email arrives we > remove all attachments that meet our criteria and replace them with > links to the files. This requires being able to give users a coded link > which can later be decoded to refer to a specific attachment. If this > change offered us additional ways to find attachments by hash or > whatever then it would be extremely useful I'm not sure if this change will help much. If the attachment changes (especially in size) there will be problems.. > Another feature we offer is a client application which compresses and > reduces bandwidth when sending/receiving emails. We currently don't try > and hash bits of email, but it's an idea I have been mulling over for > IMAP users where we typically see the data sent via SMTP, then uploaded > to the imap "sent items", then often downloaded again when the client > polls the sent items for new messages (durr). Being able to see if we > have binary content which matches a specific hash could be extremely > interesting Related to that, I've been thinking of a transparent caching Dovecot proxy. > I'm not sure if with your current proposal I can do 100% of the above? > For example it's not clear if 4) is still possible? Also without a > "guaranteed" hash we can't use the hash as a lookup key in a key/value > storage system (which implies another mapping of keys to keys is > required). Yeah, attachment-instance-key -> attachment-key -> attachment data lookup would be the only safe way to do this. > Can we do an (efficient) offline scan of messages looking for > duplicated hash keys (ie can the server calculate hashes for all > attachment parts ahead of time) Well .. the way it works is that you have files: hash-guid hash2-guid2 hashes/hash hashes/hash2 If two attachments have the same hash but different content, you'll end up with: hash-guid1 hash-guid2 hashes/hash Where hash-guid1 and hash-guid2 are different files, and only one of them is hard linked to hashes/hash. To find duplicates, you can stat() all files and find which have identical hash but different inode.