still working with utf8

Tom Allison Thu, 21 Jun 2007 19:43:06 -0700

OK, I sorted out what the deal is with charsets, Encode, utf8 andother goodies.

Now I have something I'm just not sure exactly how it is supposet tooperate.


I have a string:
=?iso-2022-jp?B?Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8kPyQkGyhC?=
That is a MIME::Base64 encoded string of iso-2022-jp characters.

After I decode_base64 them and decode($text,'iso-2022-jp',utf8') themI can print out something that looks exactly like japanese characters.

But you can't match /(\w+) on them. It's apparently one "word"without spaces in it.Um... I don't know Japanese. But I guess this string of spaghetti(to me) is actually a language where one character as represented ina unicode terminal is actually one 'word' according to the perldefinition of a word...

In english, this would pick apart words in a sense that is simple forme and many on this list to understand.

I guess my question is, for CJK languages, should I expect the notionof using a regex like \w+ to pick up entire strings of text insteadof discrete words like latin based languages?




--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

still working with utf8

Reply via email to