I see that a lot in sextortion emails. So far, I’ve seen the word “bitcoin” encoded (obfuscated) the following ways:
bitc%D0%BEin bit%D1%81oin bit%D1%81%D0%BEin And the word “wallet” as: w%D0%B0ll%D0%B5t These sextortion scammers are clever. So, instead of filtering on the word “bitcoin”, I now filter on a bitcoin regex (see below) and some other words such as “pixel”, “virus”, etc. which are always a part of the sextortion message. body __BITCOIN /\b[13][a-km-zA-HJ-NP-Z1-9]{25,34}\b/ Steve From: Mark London <m...@psfc.mit.edu> Date: Thursday, June 28, 2018 at 2:26 PM To: "users@spamassassin.apache.org" <users@spamassassin.apache.org> Subject: Re: Using UTF-8 characters to avoid spam filter rules. On 6/28/2018 1:46 PM, users-digest-h...@spamassassin.apache.org<mailto:users-digest-h...@spamassassin.apache.org> wrote: Subject: Re: Using UTF-8 characters to avoid spam filter rules. From: RW <rwmailli...@googlemail.com><mailto:rwmailli...@googlemail.com> Date: 6/26/2018 12:12 PM To: users@spamassassin.apache.org<mailto:users@spamassassin.apache.org> On Tue, 26 Jun 2018 00:33:11 -0400 Mark London wrote: Hi - Some of the words in the spam email below, are using UTF-8 characters, to avoid spam detection. I.e. the phrase "bitcoin wallet address", are not the simple ASCII characters that they appear to be. View the source of my email, to understand what I'm talking about. Is there any rule I canu se, to detect messages that are mostly plain ASCII characters, but are using enough UTF-8 characters, that obviously have been put in to avoid spam rules? You can test for specific obfuscated words like this: body FUZZY_BITCOIN /<B>(?!itcoin)<I><T><C><O><I><N>/i replace_rules FUZZY_BITCOIN For anything more general you'd have to match on lookalike characters from non-roman codepages embedded in ASCII (or roman) words. Finding Accented characters or general multibyte UTF-8 is not particularly suspicious. Thanks for the info. I had never come across this issue before, and was afraid that more spammer would start doing it. In which case, I would think that if a plain text message contained a lot of "suspicious" multibyte UTF-8 characters embedded into roman characters words , that this would make it suspicious enough to flag. However, for now, this spam message was the only one I've seen like that. So I won't worry about it for now. - Mark