On 30 Nov 2018, at 17:49, Amir Caspi wrote:
On Nov 30, 2018, at 7:00 AM, Bill Cole
<sausers-20150...@billmail.scconsult.com> wrote:
Since HTML is already getting rendered to text, then perhaps the
conversion code should strip (literally, just delete) any zero-width
characters during this conversion? That should make normal body
rules, and Bayes, function properly, no?
Not if they are *looking for* those characters.
But AFAIK we're only looking for those characters with rawbody rules,
Not so.
because it's really hard to search for them in regular body rules...
no?
No.
See the relevant rule cluster (all with 'ZW' in their names) in KAM.cf
and __UNICODE_OBFU_ZW in the standard ruleset.
Also see my more generic (but still useful!) __SCC_SHORT_WORDS and
derivatives in KAM.cf: it is a body rule that takes advantage of the
fact that zero-width typographical control characters create logical
word breaks as far as Perl is concerned.
--
Bill Cole