On Thu, 7 May 2020, RW wrote:

On Thu, 7 May 2020 11:39:07 -0700 (PDT)
John Hardin wrote:

100% 4-byte UTF8? That should be trivially easy to detect.

Comments solicited.

   body       __4BYTE_UTF8_WORD
/(?:\xf0\x9d[\x9a-\x9f][\x80-\xff]){3,10}/ tflags
__4BYTE_UTF8_WORD     multiple, maxhits=10 meta
SUSP_UTF8_WORD_MANY   __4BYTE_UTF8_WORD > 9

Potential FP for some languages because it's rather broad, it might
be possible to narrow it to just the 4-byte math glyphs that render
readable English text.

Actually it's not broad enough to cover even the mathematical
letters.

This covers them all without any overlap:

 /(?:\xf0\x9d[\x90-\x9f][\x80-\xbf]){3,10}/

It does include digits and Greek letters (the mathematical versions).

Updated, thanks.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Phobias should not be the basis for laws.
-----------------------------------------------------------------------
 Tomorrow: the 75th anniversary of VE day

Reply via email to