On Wed, 30 May 2012 08:26:44 -0700 jdow <j...@earthlink.net> wrote: > I'm idly wondering what affect this would have on the time to scan a > single email.
Actually converting from the original encoding to UTF-8 is very fast. Internally, Perl uses pretty fast C code to convert between character encodings. As for Unicode regexes, I think they're pretty efficient in Perl. We added UTF-8 support to our Bayes tokenizer and we use some pretty hairy regexes to pick out tokens (handling CJK glyphs is interesting.) Performance seems decent enough. Regards, David.