On Tue, 30 Oct 2018, Cedric Knight wrote:
I thought of submitting a patch via Bugzilla, but then decided to first
ask and check that I understood the general principles of body checks,
and SpamAssassin's current approach to Unicode. Apologies for the length
of this message. I hope the main points make sense.
A fair number of webcam bitcoin 'sextortion' scams have evaded detection
and worried recipients because of including relevant credentials.
(Incidentally, I assume the credentials and addresses are mostly from
the 2012 LinkedIn breach, but someone on the RIPE abuse list reports
Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of
this spam, but on writing body regexes to catch the wave around 16
October, I noticed that my rules weren't matching because the source was
liberally injected with invisible characters:
Content preview: I a<U+200C>m a<U+200C>wa<U+200C>re blabla is one of
your pa<U+200C>ss. L<U+200C>ets g<U+200C>et strai<U+200C>ght
to<U+200C> po<U+200C>i<U+200C>nt. No<U+200C>t o<U+200C>n<U+200C>e
Would you send me a zipped copy? I would like to update the ZW text
obfuscation rule for that, and possibly others.
As minor points, 'Format' excludes a couple of separator characters in
the same range that instead match [:space:]
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character:]
Then there is the C1 [:cntrl:] set, which some MUA's may render
silently, I think including the 0x9D matched by the recent
__UNICODE_OBFU_ZW (what's the significance of UNICODE in the rule name?):
https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control:]
Finally, there may be a case for including as 'almost' invisible narrow
blanks like U+200A   U+202F and maybe U+205F. The Perl Unicode
database may not be completely up-to-date here, and Perl 5.18 doesn't
recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24
does.
"UNICODE" because the invisible crap ain't ANSI. :)
So my patch was going to be something to eliminate Format characters
from get_rendered_body_text_array() like:
--- lib/Mail/SpamAssassin/Message.pm (revision 1844922)
+++ lib/Mail/SpamAssassin/Message.pm (working copy)
@@ -1167,6 +1167,8 @@
$text =~ s/\n+\s*\n+/\x00/gs; # double newlines => null
# $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace (incl. VT, NBSP) => space
# $text =~ tr/ \t\n\r\x0b/ /s; # whitespace (incl. VT) => single space
+ # do not render zero-width Unicode characters used as obfuscation:
+ $text =~
s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs;
$text =~ s/\s+/ /gs; # Unicode whitespace => single space
$text =~ tr/\x00/\n/; # null => newline
The problem with this approach is the *presence* of such characters is a
pretty strong spam sign.
Potentially those tests could be moved to RAWBODY rules, though - I'll
investigate that for the ZW rule.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
...the Fates notice those who buy chainsaws...
-- www.darwinawards.com
-----------------------------------------------------------------------
Tomorrow: Halloween