OK, I sorted out what the deal is with charsets, Encode, utf8 and other goodies.

Now I have something I'm just not sure exactly how it is supposet to operate.

I have a string:
=?iso-2022-jp?B?Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8kPyQkGyhC?=
That is a MIME::Base64 encoded string of iso-2022-jp characters.

After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them I can print out something that looks exactly like japanese characters.

But you can't match /(\w+) on them. It's apparently one "word" without spaces in it. Um... I don't know Japanese. But I guess this string of spaghetti (to me) is actually a language where one character as represented in a unicode terminal is actually one 'word' according to the perl definition of a word...

In english, this would pick apart words in a sense that is simple for me and many on this list to understand.

I guess my question is, for CJK languages, should I expect the notion of using a regex like \w+ to pick up entire strings of text instead of discrete words like latin based languages?



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to