On Nov 9, 2018, at 8:49 AM, John Hardin <jhar...@impsec.org> wrote:
> 
>> rawbody   HTML_ENC_ASCII   
>> /(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i
> 
> I'll add that too so that we can compare the results.

Per my reply a few minutes ago, I think this will be too restrictive.  While 
the current batch may rely on pure ASCII encoding, it's only a matter of time 
until they start to throw unicode lookalikes in there.  I don't think there's 
any legitimate reason for a long string of encoded chars, so using either of 
the two rules I proposed yesterday would catch ALL HTML-encoded characters (in 
the full UTF-16 set).

> Early results (not all corpora are in yet) look *very* promising:
> 3% of spam, S/O .958 and almost all spam hits are <5 points.

Cool!  Though it looks like results are slightly down now, later in the day... 
only ~1% of spam and S/O 0.931.  Looks like it does hit a few hams, and on a 
few corpora, hits ONLY ham.

I'd be interested to know if there's a performance difference between my two 
proposed rules.  I suspect the second should run (slightly) faster.  I think 
they'll both catch exactly the same number of spams (barring case sensitivity, 
where the first rule needs to be corrected), and I don't foresee a significant 
FP danger on the second rule despite its relative generality.

> I think we have a winner. Thanks, Amir (and possibly RW)!

My pleasure. Please keep us posted on which version of the two rules performs 
best.

What's the recommendation on score?  Or meta rules?

What would be the timeline to distribute the rule via sa-update?

Cheers!

-- Amir

Reply via email to