On Nov 8, 2018, at 4:51 PM, RW <rwmailli...@googlemail.com> wrote: > > Unnecessary encoding is fairly common, but a long runs of ASCII > characters encoded like this seems extreme.
Right, that was a question I had asked in my email this morning... whether we have a rule to detect long sequences of HTML entities. It would seem not. John, is that something we can test in a sandbox and see how it performs in masscheck? Proposed rule: body AC_HTML_ENTITY_BONANZA (?:&(?:[A-Za-z0-9]{2,}|#(?:[0-9]{2,5}|x[0-9A-F]{2,4}));\s*){20} describe AC_HTML_ENTITY_BONANZA Long run of HTML-encoded characters score AC_HTML_ENTITY_BONANZA 0.001 This should catch either decimal or hex encoding, or named entities, and allows the characters to be separated by variable-length whitespace (in case they use actual whitespace instead of encoded whitespace). If the regexp above is too complex, we could just match on the entity boundaries, restricting to allowable characters inside: body AC_HTML_ENTITY_BONANZA (?:&[A-Za-z0-9#]{2,};\s*){20} Either should work, I believe. Cheers. --- Amir