On Sun, 02 Sep 2018 00:02:45 -0400 Bill Cole wrote: > On 1 Sep 2018, at 18:22 (-0400), David B Funk wrote: > > > On the other-hand, if you want to decode the subject line and then > > pattern-match against all the possible UTF-8 emojies, you're going > > to end up with a rather unwieldy rule. > > SA "header" rules match against decoded headers, not the Base64 or QP > encoded text. > > In principle, modern Perl can match against named Unicode > "properties" so in principle it should be possible to have a rule > something like: > > header EMOJI_IN_SUBJ Subject =~ '/[\p{Miscellaneous Symbols and > Pictographs}\p{Emoticons}\p{Ornamental Dingbats}]/' > > HOWEVER: this does not work in current SA. I have not dissected > exactly why but I know there have been many problems in handling > Unicode in the SA code.
My understanding is that for historic reasons there is heavy use of 'use byte', so SpamAssassin sees text as a series of bytes in whatever character set it's written in. normalize_charset allows text to be converted to UTF-8, which makes it easier to match byte sequences, but a byte can't be an emoticon etc. > Universal Unicode support is a defining goal > of SA 4.0.0, so maybe this would be possible in the svn 'trunk' > codebase where a lot of that work is done. That's going to be very disruptive. Does anyone know if anything is being done to mitigate it? Controlling it per rule with tflags would be useful.