On Wed, 5 Dec 2018, Grant Taylor wrote:
On 12/05/2018 02:45 PM, John Hardin wrote:
I've added a "too many [ascii][unicode][ascii]" rule based on that but I
suspect it will be pretty FP-prone and will be pretty large if we want to
avoid whack-a-mole syndrome. For this, normalize + bayes is probably the
best bet.
Is it possible to detect when a Unicode code point is being used in place of
an ASCII / ANSI character specifically to avoid pattern detection? I.e.
multiple Unicode code points that represent or are otherwise a stand in for
an ASCII / ANSI "a"?
Take a look at replace_rules in the repo (both standard and sandboxes).
Or is keeping up with this list tantamount to whack-a-mole?
The unicode replacements are fairly stable, it's looking for specific
obfuscated words (like "bitcoin") that's whack-a-mole.
I would think that too high of a percentage of Unicode when bog standard
ASCII / ANSI would suffice would be an indication in and of itself. I'm not
seeing how legitimate (non-spam) email would trigger a false positive if the
percentage was tuned correctly.
The problem there is, that's really strongly based towards English text.
Spanish and French, for example, would have ASCII, but it would also have
a fairly high proportion of accented characters.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
The problem is when people look at Yahoo, slashdot, or groklaw and
jump from obvious and correct observations like "Oh my God, this
place is teeming with utter morons" to incorrect conclusions like
"there's nothing of value here". -- Al Petrofsky, in Y! SCOX
-----------------------------------------------------------------------
2 days until The 77th anniversary of Pearl Harbor