You can start with http://homoglyphs.net/?unicodepos=1 and the search term homoglyphs might get you even more extensive lists.
On 1 January 2015 at 03:54, John Hardin <jhar...@impsec.org> wrote: > On Wed, 31 Dec 2014, Martin Gregorie wrote: > > During last night I received a phishing message with a new (to me >> anyway) form of obfuscation which can only be used inside HTML body text >> using us-ascii encoding. The obfuscation was apparently aimed at SA and >> similar scanners because its not obvious to anybody reading the message: >> every 'o' (0x6f) in the text is replaced by ο >> >> My Perl-fu isn't good enough to encode this in a regex - can anybody >> help? >> > > Take a look at 25_replace.cf (esp. tags C and E), and the various FUZZY_* > rules. It's not feasible to do broadly, but specific commonly-obfuscated > words and short phrases can be focused on and that potentially would help > Bayes recognize such as spammy more quickly. > > I've been extending 25_replace.cf as I see more different types of > obfuscation like this, but it's a bit hard to keep up. Given a list of > Unicode code points that look like specific Latin letters, it should not be > hard to automatically generate the tag subrules for obfuscation for all the > encodings. > > Is there such a list anywhere already that could be leveraged? I know we > were discussing unicode normalization of body text at one point, is there > anything there we could use? > > -- > John Hardin KA7OHZ http://www.impsec.org/~jhardin/ > jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org > key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 > ----------------------------------------------------------------------- > It is not the business of government to make men virtuous or > religious, or to preserve the fool from the consequences of his own > folly. -- Henry George > ----------------------------------------------------------------------- > 944 days since the first successful private support mission to ISS > (SpaceX) >