On Tue, 9 Dec 2003 07:13:11 -0500, Scott Sprunger <[EMAIL PROTECTED]> posted to spamassassin-talk: > I wanted to test a theory so I've been trying to come up with a > rule that will catch encoded strings in the subject of a message. > So far I've tried the rules below, but none of them are hitting. > Any suggestions? > > rawbody T_SBJT_ENC /^Subject: > ?=\?(us\-ascii|iso\-8859\-1|windows\-1251)\?/i > describe T_SBJT_ENC Subject uses encoding (us-ascii, ISO or windows) > score T_SBJT_ENC .01 > > full T_SBJT_ENC /^Subject: > ?=\?(us\-ascii|iso\-8859\-1|windows\-1251)\?/i > describe T_SBJT_ENC Subject uses encoding (us-ascii, ISO or windows) > score T_SBJT_ENC .01 > > header T_SBJT_ENC Subject =~ > /=\?(us\-ascii|iso\-8859\-1|windows\-1251)\?/i > describe T_SBJT_ENC Subject uses encoding (us-ascii, ISO or windows) > score T_SBJT_ENC .01 <...> > What I'm looking for are subject headers as shown below: > > Subject: =?us-ascii?B?MCBNZW4sIGl0IHJlYWxseSB3b3JrcyEgZnA=?= iwsgfb > Subject: =?iso-8859-1?b?SSBhbSBub3cgdG90YWxseSBkZWJ0IGZyZWU=?= > Subject: =?windows-1251?B?QmExayBmaTF0ZXJzPyAtIGZvcmdldA==?=
Assuming the Subject:raw thingy works (just saw that in another reply) you will need to clarify a couple of issues for yourself. Are you looking for any RFC2047 encoding or specifically the base64 type? The Subjects you list could have been encoded (validly) like this just as well: Subject: =?us-ascii?Q?0_Men,_it_really_works!_fp?= iwsgfb Subject: =?iso-8859-1?Q?I_am_now_totally_debt_free?= Subject: =?windows-1251?Q?Ba1k_fi1ters?_-_forget?= (or, seeing as they are in fact all just pure 7bit us-ascii, of course Subject: 0 Men, it really works! fp iwsgfb Subject: I am now totally debt free Subject: Ba1k fi1ters? - forget so it's a pretty safe assumption that the encoding was used purely for obfuscation, or out of incompetence. But I digress ...) Are you looking for ISO-8859-1 in particular, or could this be extended to cover other cases? In particular, 8859-15 is virtually identical to -1 except for the Euro sign and some other minor tweaks, and is getting more and more widespread. All of the other 8859 sets are identical to US-ASCII in the lower 128 bytes IIRC and so could be used to encode a message which is in fact in US-ASCII. As a minor terminological nit, there are many ISO character sets other than 8859 so your rule descriptions are not entirely accurate. In particular, ISO-646 is plain ole 7-bit ASCII and ISO-10646 is Unicode. There is also ISO-2022 which is used e.g. in Japan. (And then of course there's a whole lot of ISO standards which do not standardize character encodings at all, but something else entirely :-) Anyway, assuming base64 encoding is the target here and that all kinds of ISO-8859 should trigger the rules, here's an attempt at synthesis: header T_SBJT_ENC Subject:raw =~ /=\?(us\-ascii|iso\-8859\-[1-9][0-9]?|windows\-1251)\?b\?/i describe T_SBJT_ENC Subject uses RFC2047 base64 encoding score T_SBJT_ENC .01 I guess some variants of ISO-8859 would legitimately use base64 most of the time, but unless you're using one of those encodings regularly, this shouldn't matter much in practice. /* era */ -- The email address era the contact information Just for kicks, imagine at iki dot fi is heavily link on my home page at what it's like to get spam filtered. If you <http://www.iki.fi/era/> 500 pieces of spam for want to reach me, see instead. each wanted message. ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk