Eager to see what you come up with in terms of ignoring the soft hyphen.
Your <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>> regex is clear, and I
understand using that for scoring purposes, but I'm looking for a way to
potentially treat look-alike characters as the latin character for bayesian
purp
>HTML::strip
html parsing to get text parts has nothing to do with html de(en)coding
>iso-8559-1
ASSP processes all content as UTF-8
>
ASSP is aware about this - and replaces soft-hyphens with hard-hyphens -
and multiple concurrent hard-hyphens with a single one
How ever - the option to remo
Is there a way to improve the way that ASSP parses certain special,
non-printing, characters? I'm having trouble with spam emails that have
their body heavily obfuscated with "soft hyphens" slipping through. They
all seem to have multipart bodies, first with an iso-8559-1 text part with
*=AD* int