Il 17/05/2021 18:12, Henrik K ha scritto:
On Mon, May 17, 2021 at 03:02:57PM +0200, Marco wrote:
So I have to add the accented character literally.
I can't understand why. Are there any limitation in Hashbl plugin with UTF8?
Maybe I have misunderstood something.
SA doesn't support UTF8 regex. It's just matching plain byte strings.
Depends on normalize_charset setting too, for best compatibility you should
match both latin and utf-8 raw byte variants: ü -> (?:\xfc|\xc3\xbc)
https://cwiki.apache.org/confluence/display/SPAMASSASSIN/WritingRulesAdvanced
Hello Henrik,
thank you for the hints. I didn't realized that SA doesn't support
UTF8 regex. Well. As you suggest, I would like to write rules coding
independent in order to avoid surprises. I tried, it doesn't work...
I have normalize_charset 1.
My text body is "Ciao, è proprio eccoci là si fa\nciao"
With
([\d\S\x{00E0}\x{c3a0}\x{00E8}\x{c3a8}\x{00EC}\x{c3ac}\x{00F2}\x{c3b2}\x{00F9}\x{c3b9}\x{00C0}\x{c380}\x{00C8}\x{c388}\x{00CC}\x{c38c}\x{00D2}\x{c392}\x{00D9}\x{c399}]+)
I see:
dbg: HashBL: __HASHBL_III_SPAM3: matches found: 'ciao,', 'è', 'proprio',
'eccoci', 'l▒', 'si', 'fa', 'ciao'
'là' seems to have bad encoded as 'l▒', so the hash doesn't match.
If I write the characters literally:
([\d\Sàèìòù]+)
I see:
dbg: HashBL: __HASHBL_III_SPAM3: matches found: 'ciao,', 'è', 'proprio',
'eccoci', 'là', 'si', 'fa', 'ciao'
Now 'là' is encoded correctly and the hash matches.
Thank you very much
Kind Regards
Marco