I got the following MIME body part below, and I’m wondering if it would make sense to filter on this as well.
Given that it’s text/plain with an implicit charset=“us-ascii” and an implicit content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4} doesn’t really parse into a 16-bit character, would it? That would be a broken MUA that made such a leap... Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. rather than the unicode16 or UTF-8 character with that hex value? There might be times when someone has sent an attachment improperly encoded this way which might have embedded binary values in it, but that’s kind of buggy anyway… it should have been done as base64 and application/octet-stream in the worst of cases if it has arbitrary binary data. I wouldn’t want a message where someone gives a couple of examples of encoding Ѐ for instance being flagged as SPAM, but if the text is 20% or more of these sequences then I would say that’s SPAM-sign. Anyway, here’s the body I saw: --1388-8200-b67c-e579-9c27-df36-12fa-a2eb Content-Type: text/plain; Thе Rеаl RеаѕоnThе Ꮯоmіng Ꮯоllарѕе...Thе rеаl rеаѕоn ᎳHY HоmеlаndSеcurіtу rеcеntlу рurchаѕеd1.7 Bіllіоn Rоundѕ оf аmmunіtіоn...Ꮃhаt Yоu Muѕt Dо Tо Ꭼnѕurе YоurSаfеtуHоmеlаnd ѕеcurіtу іѕ thеrе tо ѕеcurеthе hоmеlаnd оnlу... Sо thеѕе Ьullеtѕаrе rеаlу mеаnt fоr thеThіѕ іѕ аn еmаіlаdvеrtіѕеmеnt thаt wаѕ ѕеnt tо уоu Ьу Ρаtrіоt Survіvаl Ρlаn. If уоuwіѕh tо nоlоngеr rеcеіvе mеѕѕаgеѕ thаt рrоmоtе ѕurvіvаl tірѕ, рlеаѕеclіck hеrе tо unѕuЬѕcrіЬе.4 Unstable as water, thou shalt not excel because thou wentest up to thy fathers bed then defiledst thou it he went up to my couch.34 And Pharaohnechoh made Eliakim the son of Josiah king in the room of Josiah his father, and turned his name to Jehoiakim, and took Jehoahaz away and he came to Egypt, and died there.37 And the thing was good in the eyes of Pharaoh, and in the eyes o! f all his servants. --1388-8200-b67c-e579-9c27-df36-12fa-a2eb