On Nov 9, 2018, at 8:49 AM, John Hardin <jhar...@impsec.org> wrote: > >> rawbody HTML_ENC_ASCII >> /(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i > > I'll add that too so that we can compare the results.
Per my reply a few minutes ago, I think this will be too restrictive. While the current batch may rely on pure ASCII encoding, it's only a matter of time until they start to throw unicode lookalikes in there. I don't think there's any legitimate reason for a long string of encoded chars, so using either of the two rules I proposed yesterday would catch ALL HTML-encoded characters (in the full UTF-16 set). > Early results (not all corpora are in yet) look *very* promising: > 3% of spam, S/O .958 and almost all spam hits are <5 points. Cool! Though it looks like results are slightly down now, later in the day... only ~1% of spam and S/O 0.931. Looks like it does hit a few hams, and on a few corpora, hits ONLY ham. I'd be interested to know if there's a performance difference between my two proposed rules. I suspect the second should run (slightly) faster. I think they'll both catch exactly the same number of spams (barring case sensitivity, where the first rule needs to be corrected), and I don't foresee a significant FP danger on the second rule despite its relative generality. > I think we have a winner. Thanks, Amir (and possibly RW)! My pleasure. Please keep us posted on which version of the two rules performs best. What's the recommendation on score? Or meta rules? What would be the timeline to distribute the rule via sa-update? Cheers! -- Amir