OK, I sorted out what the deal is with charsets, Encode, utf8 and
other goodies.
Now I have something I'm just not sure exactly how it is supposet to
operate.
I have a string:
=?iso-2022-jp?B?Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8kPyQkGyhC?=
That is a MIME::Base64 encoded string of iso-2022-jp characters.
After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them
I can print out something that looks exactly like japanese characters.
But you can't match /(\w+) on them. It's apparently one "word"
without spaces in it.
Um... I don't know Japanese. But I guess this string of spaghetti
(to me) is actually a language where one character as represented in
a unicode terminal is actually one 'word' according to the perl
definition of a word...
In english, this would pick apart words in a sense that is simple for
me and many on this list to understand.
I guess my question is, for CJK languages, should I expect the notion
of using a regex like \w+ to pick up entire strings of text instead
of discrete words like latin based languages?
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/