Re: [Assp-test] soft hyphen fooling Bayesian analysis

2022-09-06 Thread K Post
Eager to see what you come up with in terms of ignoring the soft hyphen. Your <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>> regex is clear, and I understand using that for scoring purposes, but I'm looking for a way to potentially treat look-alike characters as the latin character for bayesian purp

Re: [Assp-test] soft hyphen fooling Bayesian analysis

2022-09-06 Thread Thomas Eckardt
>HTML::strip html parsing to get text parts has nothing to do with html de(en)coding >iso-8559-1 ASSP processes all content as UTF-8 >­ ASSP is aware about this - and replaces soft-hyphens with hard-hyphens - and multiple concurrent hard-hyphens with a single one How ever - the option to remo

[Assp-test] soft hyphen fooling Bayesian analysis

2022-09-06 Thread K Post
Is there a way to improve the way that ASSP parses certain special, non-printing, characters? I'm having trouble with spam emails that have their body heavily obfuscated with "soft hyphens" slipping through. They all seem to have multipart bodies, first with an iso-8559-1 text part with *=AD* int