On Tue, 30 Oct 2018, Cedric Knight wrote:

I thought of submitting a patch via Bugzilla, but then decided to first
ask and check that I understood the general principles of body checks,
and SpamAssassin's current approach to Unicode. Apologies for the length
of this message. I hope the main points make sense.

A fair number of webcam bitcoin 'sextortion' scams have evaded detection
and worried recipients because of including relevant credentials.
(Incidentally, I assume the credentials and addresses are mostly from
the 2012 LinkedIn breach, but someone on the RIPE abuse list reports
Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of
this spam, but on writing body regexes to catch the wave around 16
October, I noticed that my rules weren't matching because the source was
liberally injected with invisible characters:
Content preview:  I a<U+200C>m a<U+200C>wa<U+200C>re blabla is one of
your pa<U+200C>ss. L<U+200C>ets   g<U+200C>et strai<U+200C>ght
to<U+200C> po<U+200C>i<U+200C>nt. No<U+200C>t o<U+200C>n<U+200C>e

Would you send me a zipped copy? I would like to update the ZW text obfuscation rule for that, and possibly others.

As minor points, 'Format' excludes a couple of separator characters in
the same range that instead match [:space:]
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character:]
Then there is the C1 [:cntrl:] set, which some MUA's may render
silently, I think including the 0x9D matched by the recent
__UNICODE_OBFU_ZW (what's the significance of UNICODE in the rule name?):
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control:]
Finally, there may be a case for including as 'almost' invisible narrow
blanks like U+200A &hairsp; U+202F and maybe U+205F. The Perl Unicode
database may not be completely up-to-date here, and Perl 5.18 doesn't
recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24
does.

"UNICODE" because the invisible crap ain't ANSI. :)

So my patch was going to be something to eliminate Format characters
from get_rendered_body_text_array() like:
--- lib/Mail/SpamAssassin/Message.pm    (revision 1844922)
+++ lib/Mail/SpamAssassin/Message.pm    (working copy)
@@ -1167,6 +1167,8 @@
  $text =~ s/\n+\s*\n+/\x00/gs;         # double newlines => null
# $text =~ tr/ \t\n\r\x0b\xa0/ /s;      # whitespace (incl. VT, NBSP) => space
# $text =~ tr/ \t\n\r\x0b/ /s;          # whitespace (incl. VT) => single space
+  # do not render zero-width Unicode characters used as obfuscation:
+  $text =~
s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs;
  $text =~ s/\s+/ /gs;                  # Unicode whitespace => single space
  $text =~ tr/\x00/\n/;                 # null => newline

The problem with this approach is the *presence* of such characters is a pretty strong spam sign.

Potentially those tests could be moved to RAWBODY rules, though - I'll investigate that for the ZW rule.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  ...the Fates notice those who buy chainsaws...
                                              -- www.darwinawards.com
-----------------------------------------------------------------------
 Tomorrow: Halloween

Reply via email to