On Dec 1, 2018, at 10:31 AM, John Hardin <jhar...@impsec.org> wrote: > >> On Thu, 29 Nov 2018, Amir Caspi wrote: >> >>> A) Could you sandbox the proposed rule change (AC_HTML_ENTITY_BONANZA_NEW) >>> and see how it performs, including possible FPs? > > Done.
Any preliminary results? Looks like we have a couple other HTML-related things that need to be added. See spample: https://pastebin.com/Few8fVfF <https://pastebin.com/Few8fVfF> 1) Looks like is now being used instead of regular spaces to join some highly spammy words. Are these turned into "regular" spaces by the HTML interpreter prior to body rules? Or do they get turned into non-breaking space characters which are different than regular spaces? Like all the ZW stuff, this seems like it should get "normalized" so it can be available both in raw and normal form for Bayes to pick up... 2) This particular spample has its "Bayes poison" text within a div with line-height:0, but there does not appear to be a rule to capture this. That same div uses font-size:1px, so I would have thought this would trigger a "tiny fonts" rule, but apparently not. It would seem our tiny font and/or other "trying to make this invisible" rules should be updated to capture these attempts. I also saw another spample which had opacity:0 set on its "Bayes poison" text, but the "low contrast" rule didn't pop. Cheers. --- Amir