Re: Best Practice: emails and file-attachments

John Haxby Wed, 16 Aug 2006 02:39:45 -0700

lude wrote:

Hi John,


thanks for the detailed answer.

You wrote:

If you're indexing a
multipart/alternative bodypart then index all the MIME headers, but only
index the content of the *first* bodypart.


Does this mean you index just the first file-attachment?

What do you advice, if you have to index mulitpart bodys (== more thenone

file-attachment)?
One lucene-document for each part (==file)?
How do you handle the queries?

MIME has no concept of "attachment", that's something that the useragent programs have a concept of -- you "attach" a file to a message.The file might be a picture, a word document, a compressed tar archive-- as far MIME is concerned they're all the same (well, apart from thecontent-* headers that describe what's "attached"). The MIME type fora message with "attachments" is "multipart". There are severalsubtypes though. If you're typing a plain text message (whose MIMEtype is text/plain, a message like this one) and you attach a jpeg imageto it you'll be sending a message whose type is multipart/mixed; thefirst part will have type text/plain and the second image/jpeg. InGoogle Mail under "more options" you can "show original" to see thecomplete MIME message and you'll see the different parts separated by aboundary.

OK. Now I'm in a position to answer your question. Often, when yousend an HTML formatted message the content of the message is sent twice:once as text/plain and once as text/html (or multipart/related if it haspictures and stuff). The two parts are alternatives, apart from theformatting (and pictures) there's no difference between the two parts,you can read either. The best fidelity of the alternatives (and therecan be more than two) is last, the poorest fidelity first, but theintent of the sender is that you can read any of them. This is amultipart/alternative bodypart. Because all parts of themultipart/alternative have the same text then you can index any of them,so index the first as that's going to be the easiest to process (it'salmost always going to be text/plain).

I've skipped loads. You need to read the RFCs. Start with RFC2045(http://www.rfc.net/rfc2045.html) and keep going. If you get stuck withthe details of how messages are constructed, go back and read RFC2822first, or at least skim it (it's quite long). Note that RFC2045references RFC822 in its abstract, where ever you see references toRFC821 and RFC822 you can read them as references to RFC2821 and RFC2822respectively -- the newer ones are a little more precise when they needto be and have rather more explanation of awkward cases that you need toknow about.

Someone earlier (and I'm sorry, I deleleted the message before realisingi should reply) said something about attached files really being in anattached .tar.gz file. Well, yes and no. An attached compressed tararchive is a bodypart like any other and will need to be indexed likeany other. That will involve breaking it open and indexing the filesthat it contains. It's not really any different to indexing an openoffice document (which is actually a zip file).

You also mentioned indexing each bodypart ("attachment") separately.Why? When I'm searching, am I going to look for the word "xyzzy" inthe first bodypart? What if it was a multipart/alternative andThunderbird (in my case) suppressed the first bodypart and "xyzzy" issomething that couldn't be rendered in the (first) text/plainalternative? To my mind, there is no use case where it makes sense tosearch a particular bodypart. There *might* be a case for searching the"prime" bodypart and "attachments" but when you read the MIME specyou'll realise that detecting what the user sees as an attachment is noteasy: it gets even harder when you discover that different mail useragents have different and legal (and sometimes reasonable) ways ofdeciding whether to treat something as in-line or as an attachment. Tobe honest, people don't remember whether something was an attachment.They think "I remember reading about xyzzy in a mail message" and go offlooking for that. They often can't tell and remember even less thatthe "xyzzy" was in something that you decided was an attachment. Andif your rules for deciding whether you have something that's intended tobe viewed as an attachment or in-line are different to the rules thatthe user's mail reader is using then you'll have Awkward Bugs toexplain. You'll read about "Content-Disposition" in the RFCs, butdon't believe that it's a foolproof way of deciding whether or notsomething is an attachment, lack of a content-disposition header doesn'tmean "inline" or "attachment" and Microsoft, bless, have weird rules allof their own for deciding whether to display something in-line or not.


jch

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Best Practice: emails and file-attachments

Reply via email to