Re: regex & utf8

Chas Owens Fri, 11 May 2007 16:06:58 -0700

On 5/11/07, Tom Allison <[EMAIL PROTECTED]> wrote:

OK, I'm reading through different unicode related perldocs and have a
rather simple question.


Under perl version 5.8, does /(\w+)/ match UTF-8 characters without
calling any special pragma?  I'm having a hard time finding something
that makes the statement that clearly.

I'm trying to parse out email content and it seems reasonable that I
could get characters in just about any conceivable format, from
ascii, latin, utf...

For simplicity I'm leaning in a direction of just converting everying
"up" to UTF8 and working all my string/regex manipulations on UTF.

So I'm trying to see if I can just use /(\w+)/ without worrying about
all this character encoding?
Or do I have to first Encode everything into UTF8?
And if so, before I Encode it, do I have to figure out what it is
first and then convert it from whatever encoding it is to UTF8?

For simplicity, it isn't necessarily a requirement that I can parse
the content into perfectly accurate words, but they have to be
completely repeatable and preferable fast.

help?


from perldoc perlunicode
snip
      Input and Output Layers
          Perl knows when a filehandle uses Perl's internal Unicode encodings
          (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened
          with the ":utf8" layer.  Other encodings can be converted to Perl's
          encoding on input or from Perl's encoding on output by use of the
          ":encoding(...)"  layer.  See open.

          To indicate that Perl source itself is using a particular encoding,
          see encoding.

      Regular Expressions
          The regular expression compiler produces polymorphic opcodes.  That
          is, the pattern adapts to the data and automatically switches to
          the Unicode character scheme when presented with Unicode data--or
          instead uses a traditional byte scheme when presented with byte
          data.
snip
      ·   Character classes in regular expressions match characters instead
          of bytes and match against the character properties specified in
          the Unicode properties database.  "\w" can be used to match a
          Japanese ideograph, for instance.

          (However, and as a limitation of the current implementation, using
          "\w" or "\W" inside a "[...]" character class will still match with
          byte semantics.)
snip

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: regex & utf8

Reply via email to