List Mail User a écrit : > > How about the case of "http=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2F" > inside of HTML? i.e. http://www.cnn.com/2003/ - from a "phishing spam", > the full line was: > > =3Chttp=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2FWORLD=2Fafrica=2F07=2F20=2Fkenya=2Ecrash=2Findex=2Ehtml=3E >
I thought these were only interpreted in quoted printable, which SA can handle anyway. So I'm talking about decoded body (not rawbody). a MUA that interpretes '=' this way even in html mail would be seriously broken. I guess some MUAs will still guess that a message is in QP even if the header says it's plain text, but this should anyway be handled by the decoder > which itself was a continuation of a previous line. If you allow for more > than just ASCII or UTF-8, there are quite a few "words" that can be built > from the first six letters of the alphabet - and a much greater amount if > we include "elite-speak". The above example need not have been a "phish" > using cnn.com, but just as easily could have been a spamvertised domain or > have been valid non-spam HTML. Unfortunately the case of MUAs accepting > non-standard (re. illegal) HTML constructs is the most common case (e.g. > Outlook and OE as well as many more MUAs which *need* to be able to read > the same emails under MS Win*). And still more cases of URIs exist, which > are not parsed by SA, but can have constructs like these with embedded > domain names (e.g. "Message-ID:" lines). Life would be much easier if all > URIs were contained within '<' and '>' (as at least one "standard" requires). > > The problem is that sometimes '=' is a word break, and sometimes > it is used a either a continuation or meta-character. Find a rule with > a very good rate at disambiguating these cases (for example, an '=' as the > final character on a line can probably almost never be ignored). and file > a Bugzilla; I'm sure the developers would at least look at whatever you > come up with. Remember to also handle '%', '#' and '$' while you're at > it:-) > well, one can find rules for the case of http://..., but it's hard to get ones when there is no scheme part. because as you said before, it is hard to guess how "foo REMOVESPACE & REMOVESPACE something.example" would be interpreted by the MUA (in http://foo REMOVESPACE &..., there is no ambiguity as "foo&" is clearly part of the uri). BTW. I have some rules to tag mail when the host part of the uri contains '&' (I don't see why the hostname part should contain this). I wonder if I can just tag if the host part contains any but \d\w\.-_:@? This would obviously reject encoded URIs. Is there enough ham where the hostname part is encoded?