Timo Sirainen wrote:
On Mon, 2010-07-19 at 18:30 +0100, William Blunn wrote:
Consider storing the recovery filter stack in the dbox metadata rather
than the attachment file.
This has a couple of upshots:
1. If one person receives a message with an attachment which is encoded
with base64 at say 19 cells (76 bytes) per line, and then re-sends the
same file as an attachment to someone else but their MUA encodes base64
at say 18 cells (72 bytes) per line, the attachment file can contain
exactly the same data, allowing for deduplication even in this case.
I thought about that also, but it would require calculating and using a
hash of the decoded message (but not the compressed message). Could get
complex.
BTW I am not attempting to suggest a complete system for de-duplication,
but rather to suggest a means by which it could be arranged that file
contents became identical so that "something else" could de-duplicate
them elsehow.
I would be interested to know what the hash you mention is needed for.
Also I would be interested to know why the hash of the fragment of the
original message stream (regardless of base64 decodeability) would not
be sufficient.
And if it isn't...
if (base64_smart_decode(&raw_data, &decoded_data, &chars_per_line) ==
SUCCESS) {
// store decoded_data to attachment file
// recovery_filter = "base64_" .concat. chars_per_line
} else {
// store raw_data to attachment file
// recovery_filter = nothing
}
// make hash of attachment file
// store pointer to dbox metadata including recovery_filter
2. Assuming we have configured Dovecot to decode base64 but not to
compress, then the file in which we store the attachment data contains
literally the exact same byte stream as if the attachment were saved out
from the MUA. I don't know what practical use this might be, but it
/sounds/ cool :-) Perhaps a suitable filesystem or backup-system could
deduplicate both a file *and* its instance as a message attachment.
I was thinking about adding some small header to the dbox file, so they
wouldn't be completely identical.
Though that is kind of the point. If everything in the small header can
go somewhere else then the small header can go away and we can store the
attachment very literally.
What kind of things are you thinking to put in the small header?
Bill