On Wed, 31 Dec 2014, Martin Gregorie wrote:

During last night I received a phishing message with a new (to me
anyway) form of obfuscation which can only be used inside HTML body text
using us-ascii encoding. The obfuscation was apparently aimed at SA and
similar scanners because its not obvious to anybody reading the message:
every 'o' (0x6f) in the text is replaced by ο

My Perl-fu isn't good enough to encode this in a regex - can anybody
help?

Take a look at 25_replace.cf (esp. tags C and E), and the various FUZZY_* rules. It's not feasible to do broadly, but specific commonly-obfuscated words and short phrases can be focused on and that potentially would help Bayes recognize such as spammy more quickly.

I've been extending 25_replace.cf as I see more different types of obfuscation like this, but it's a bit hard to keep up. Given a list of Unicode code points that look like specific Latin letters, it should not be hard to automatically generate the tag subrules for obfuscation for all the encodings.

Is there such a list anywhere already that could be leveraged? I know we were discussing unicode normalization of body text at one point, is there anything there we could use?

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  It is not the business of government to make men virtuous or
  religious, or to preserve the fool from the consequences of his own
  folly.                                              -- Henry George
-----------------------------------------------------------------------
 944 days since the first successful private support mission to ISS (SpaceX)

Reply via email to