On Dec 1, 2018, at 10:31 AM, John Hardin <jhar...@impsec.org> wrote:
> 
>> On Thu, 29 Nov 2018, Amir Caspi wrote:
>> 
>>> A) Could you sandbox the proposed rule change (AC_HTML_ENTITY_BONANZA_NEW) 
>>> and see how it performs, including possible FPs?
> 
> Done.

Any preliminary results?

Looks like we have a couple other HTML-related things that need to be added.  
See spample:
https://pastebin.com/Few8fVfF <https://pastebin.com/Few8fVfF>

1) Looks like &nbsp; is now being used instead of regular spaces to join some 
highly spammy words.  Are these turned into "regular" spaces by the HTML 
interpreter prior to body rules?  Or do they get turned into non-breaking space 
characters which are different than regular spaces?  Like all the ZW stuff, 
this seems like it should get "normalized" so it can be available both in raw 
and normal form for Bayes to pick up...

2) This particular spample has its "Bayes poison" text within a div with 
line-height:0, but there does not appear to be a rule to capture this.  That 
same div uses font-size:1px, so I would have thought this would trigger a "tiny 
fonts" rule, but apparently not.

It would seem our tiny font and/or other "trying to make this invisible" rules 
should be updated to capture these attempts.

I also saw another spample which had opacity:0 set on its "Bayes poison" text, 
but the "low contrast" rule didn't pop.

Cheers.

--- Amir

Reply via email to