lude wrote:
Hi John,

thanks for the detailed answer.

You wrote:
If you're indexing a
multipart/alternative bodypart then index all the MIME headers, but only
index the content of the *first* bodypart.

Does this mean you index just the first file-attachment?
What do you advice, if you have to index mulitpart bodys (== more then one
file-attachment)?
One lucene-document for each part (==file)?
How do you handle the queries?
MIME has no concept of "attachment", that's something that the user agent programs have a concept of -- you "attach" a file to a message. The file might be a picture, a word document, a compressed tar archive -- as far MIME is concerned they're all the same (well, apart from the content-* headers that describe what's "attached"). The MIME type for a message with "attachments" is "multipart". There are several subtypes though. If you're typing a plain text message (whose MIME type is text/plain, a message like this one) and you attach a jpeg image to it you'll be sending a message whose type is multipart/mixed; the first part will have type text/plain and the second image/jpeg. In Google Mail under "more options" you can "show original" to see the complete MIME message and you'll see the different parts separated by a boundary.

OK. Now I'm in a position to answer your question. Often, when you send an HTML formatted message the content of the message is sent twice: once as text/plain and once as text/html (or multipart/related if it has pictures and stuff). The two parts are alternatives, apart from the formatting (and pictures) there's no difference between the two parts, you can read either. The best fidelity of the alternatives (and there can be more than two) is last, the poorest fidelity first, but the intent of the sender is that you can read any of them. This is a multipart/alternative bodypart. Because all parts of the multipart/alternative have the same text then you can index any of them, so index the first as that's going to be the easiest to process (it's almost always going to be text/plain).

I've skipped loads. You need to read the RFCs. Start with RFC2045 (http://www.rfc.net/rfc2045.html) and keep going. If you get stuck with the details of how messages are constructed, go back and read RFC2822 first, or at least skim it (it's quite long). Note that RFC2045 references RFC822 in its abstract, where ever you see references to RFC821 and RFC822 you can read them as references to RFC2821 and RFC2822 respectively -- the newer ones are a little more precise when they need to be and have rather more explanation of awkward cases that you need to know about.

Someone earlier (and I'm sorry, I deleleted the message before realising i should reply) said something about attached files really being in an attached .tar.gz file. Well, yes and no. An attached compressed tar archive is a bodypart like any other and will need to be indexed like any other. That will involve breaking it open and indexing the files that it contains. It's not really any different to indexing an open office document (which is actually a zip file).

You also mentioned indexing each bodypart ("attachment") separately. Why? When I'm searching, am I going to look for the word "xyzzy" in the first bodypart? What if it was a multipart/alternative and Thunderbird (in my case) suppressed the first bodypart and "xyzzy" is something that couldn't be rendered in the (first) text/plain alternative? To my mind, there is no use case where it makes sense to search a particular bodypart. There *might* be a case for searching the "prime" bodypart and "attachments" but when you read the MIME spec you'll realise that detecting what the user sees as an attachment is not easy: it gets even harder when you discover that different mail user agents have different and legal (and sometimes reasonable) ways of deciding whether to treat something as in-line or as an attachment. To be honest, people don't remember whether something was an attachment. They think "I remember reading about xyzzy in a mail message" and go off looking for that. They often can't tell and remember even less that the "xyzzy" was in something that you decided was an attachment. And if your rules for deciding whether you have something that's intended to be viewed as an attachment or in-line are different to the rules that the user's mail reader is using then you'll have Awkward Bugs to explain. You'll read about "Content-Disposition" in the RFCs, but don't believe that it's a foolproof way of deciding whether or not something is an attachment, lack of a content-disposition header doesn't mean "inline" or "attachment" and Microsoft, bless, have weird rules all of their own for deciding whether to display something in-line or not.

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to