regex & utf8

Tom Allison Fri, 11 May 2007 15:54:38 -0700

OK, I'm reading through different unicode related perldocs and have arather simple question.

Under perl version 5.8, does /(\w+)/ match UTF-8 characters withoutcalling any special pragma? I'm having a hard time finding somethingthat makes the statement that clearly.

I'm trying to parse out email content and it seems reasonable that Icould get characters in just about any conceivable format, fromascii, latin, utf...

For simplicity I'm leaning in a direction of just converting everying"up" to UTF8 and working all my string/regex manipulations on UTF.

So I'm trying to see if I can just use /(\w+)/ without worrying aboutall this character encoding?

Or do I have to first Encode everything into UTF8?

And if so, before I Encode it, do I have to figure out what it isfirst and then convert it from whatever encoding it is to UTF8?

For simplicity, it isn't necessarily a requirement that I can parsethe content into perfectly accurate words, but they have to becompletely repeatable and preferable fast.


help?

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

regex & utf8

Reply via email to