List Mail User a écrit :
> 
>       How about the case of "http=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2F"
> inside of HTML?   i.e. http://www.cnn.com/2003/ - from a "phishing spam",
> the full line was:
> 
> =3Chttp=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2FWORLD=2Fafrica=2F07=2F20=2Fkenya=2Ecrash=2Findex=2Ehtml=3E
> 

I thought these were only interpreted in quoted printable, which SA can
handle anyway. So I'm talking about decoded body (not rawbody). a MUA
that interpretes '=' this way even in html mail would be seriously
broken. I guess some MUAs will still guess that a message is in QP even
if the header says it's plain text, but this should anyway be handled by
the decoder

> which itself was a continuation of a previous line.  If you allow for more
> than just ASCII or UTF-8, there are quite a few "words" that can be built
> from the first six letters of the alphabet - and a much greater amount if
> we include "elite-speak".  The above example need not have been a "phish"
> using cnn.com, but just as easily could have been a spamvertised domain or
> have been valid non-spam HTML.  Unfortunately the case of MUAs accepting
> non-standard (re. illegal) HTML constructs is the most common case (e.g.
> Outlook and OE as well as many more MUAs which *need* to be able to read
> the same emails under MS Win*).  And still more cases of URIs exist, which
> are not parsed by SA, but can have constructs like these with embedded
> domain names (e.g. "Message-ID:" lines).  Life would be much easier if all
> URIs were contained within '<' and '>' (as at least one "standard" requires).
> 
>       The problem is that sometimes '=' is a word break, and sometimes
> it is used a either a continuation or meta-character.  Find a rule with
> a very good rate at disambiguating these cases (for example, an '=' as the
> final character on a line can probably almost never be ignored). and file
> a Bugzilla;  I'm sure the developers would at least look at whatever you
> come up with.  Remember to also handle '%', '#' and '$' while you're at
> it:-)
> 

well, one can find rules for the case of http://..., but it's hard to
get ones when there is no scheme part. because as you said before, it is
hard to guess how "foo REMOVESPACE & REMOVESPACE something.example"
would be interpreted by the MUA (in http://foo REMOVESPACE &..., there
is no ambiguity as "foo&" is clearly part of the uri).

BTW. I have some rules to tag mail when the host part of the uri
contains '&' (I don't see why the hostname part should contain this). I
wonder if I can just tag if the host part contains any but \d\w\.-_:@?
This would obviously reject encoded URIs. Is there enough ham where the
hostname part is encoded?

Reply via email to