Re: Canonicalizing text parts to UTF-8 before applying body rules

David F. Skoll Wed, 30 May 2012 08:41:32 -0700

On Wed, 30 May 2012 08:26:44 -0700
jdow <j...@earthlink.net> wrote:

> I'm idly wondering what affect this would have on the time to scan a
> single email.


Actually converting from the original encoding to UTF-8 is very fast.
Internally, Perl uses pretty fast C code to convert between character
encodings.

As for Unicode regexes, I think they're pretty efficient in Perl.  We
added UTF-8 support to our Bayes tokenizer and we use some pretty
hairy regexes to pick out tokens (handling CJK glyphs is interesting.)
Performance seems decent enough.

Regards,

David.

Re: Canonicalizing text parts to UTF-8 before applying body rules

Reply via email to