Re: regex & utf8

Dr.Ruud Sat, 12 May 2007 02:09:27 -0700

Tom Allison schreef:

> Under perl version 5.8, does /(\w+)/ match UTF-8 characters without
> calling any special pragma?


Yes, but only if your data is proper. Mind that any ASCII-character is a
UTF-8 character too (U+0000 .. U+007F).


> So I'm trying to see if I can just use /(\w+)/ without worrying about
> all this character encoding?

Only if your data is proper. A file is just a string of bytes. If you
use the proper IO-layer while reading in the file, then you'll end up
with proper data (a string of characters, not of bytes) to work with.

A UTF-8 encoded file can't tell you that it is UTF-8 encoded. For
example a UTF-8 BOM at the start (as Windows Notepad uses) is not proof.
So you need to know beforehand.

-- 
Affijn, Ruud

"Gewoon is een tijger."


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: regex & utf8

Reply via email to