lude wrote:
Hi John,
thanks for the detailed answer.
You wrote:
If you're indexing a
multipart/alternative bodypart then index all the MIME headers, but only
index the content of the *first* bodypart.
Does this mean you index just the first file-attachment?
What do you advice, if you have to index mulitpart bodys (== more then
one
file-attachment)?
One lucene-document for each part (==file)?
How do you handle the queries?
MIME has no concept of "attachment", that's something that the user
agent programs have a concept of -- you "attach" a file to a message.
The file might be a picture, a word document, a compressed tar archive
-- as far MIME is concerned they're all the same (well, apart from the
content-* headers that describe what's "attached"). The MIME type for
a message with "attachments" is "multipart". There are several
subtypes though. If you're typing a plain text message (whose MIME
type is text/plain, a message like this one) and you attach a jpeg image
to it you'll be sending a message whose type is multipart/mixed; the
first part will have type text/plain and the second image/jpeg. In
Google Mail under "more options" you can "show original" to see the
complete MIME message and you'll see the different parts separated by a
boundary.
OK. Now I'm in a position to answer your question. Often, when you
send an HTML formatted message the content of the message is sent twice:
once as text/plain and once as text/html (or multipart/related if it has
pictures and stuff). The two parts are alternatives, apart from the
formatting (and pictures) there's no difference between the two parts,
you can read either. The best fidelity of the alternatives (and there
can be more than two) is last, the poorest fidelity first, but the
intent of the sender is that you can read any of them. This is a
multipart/alternative bodypart. Because all parts of the
multipart/alternative have the same text then you can index any of them,
so index the first as that's going to be the easiest to process (it's
almost always going to be text/plain).
I've skipped loads. You need to read the RFCs. Start with RFC2045
(http://www.rfc.net/rfc2045.html) and keep going. If you get stuck with
the details of how messages are constructed, go back and read RFC2822
first, or at least skim it (it's quite long). Note that RFC2045
references RFC822 in its abstract, where ever you see references to
RFC821 and RFC822 you can read them as references to RFC2821 and RFC2822
respectively -- the newer ones are a little more precise when they need
to be and have rather more explanation of awkward cases that you need to
know about.
Someone earlier (and I'm sorry, I deleleted the message before realising
i should reply) said something about attached files really being in an
attached .tar.gz file. Well, yes and no. An attached compressed tar
archive is a bodypart like any other and will need to be indexed like
any other. That will involve breaking it open and indexing the files
that it contains. It's not really any different to indexing an open
office document (which is actually a zip file).
You also mentioned indexing each bodypart ("attachment") separately.
Why? When I'm searching, am I going to look for the word "xyzzy" in
the first bodypart? What if it was a multipart/alternative and
Thunderbird (in my case) suppressed the first bodypart and "xyzzy" is
something that couldn't be rendered in the (first) text/plain
alternative? To my mind, there is no use case where it makes sense to
search a particular bodypart. There *might* be a case for searching the
"prime" bodypart and "attachments" but when you read the MIME spec
you'll realise that detecting what the user sees as an attachment is not
easy: it gets even harder when you discover that different mail user
agents have different and legal (and sometimes reasonable) ways of
deciding whether to treat something as in-line or as an attachment. To
be honest, people don't remember whether something was an attachment.
They think "I remember reading about xyzzy in a mail message" and go off
looking for that. They often can't tell and remember even less that
the "xyzzy" was in something that you decided was an attachment. And
if your rules for deciding whether you have something that's intended to
be viewed as an attachment or in-line are different to the rules that
the user's mail reader is using then you'll have Awkward Bugs to
explain. You'll read about "Content-Disposition" in the RFCs, but
don't believe that it's a foolproof way of deciding whether or not
something is an attachment, lack of a content-disposition header doesn't
mean "inline" or "attachment" and Microsoft, bless, have weird rules all
of their own for deciding whether to display something in-line or not.
jch
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]