On Thu, Jul 12, 2012 at 3:25 PM, Manfred Lotz <manfred.l...@arcor.de> wrote: > This is really nice. I fumbled with unpack before but have to admit > that I didn't know about 'use bytes' which is the key.
Couple interesting links, unpack in painful detail: http://www.perlmonks.org/?node_id=224666 and utf-8 and "use bytes" info: http://perldoc.perl.org/perluniintro.html (search for 'use bytes' and look around) because use bytes; affects the whole script, you want to finish w/ no bytes; How Do I Know Whether My String Is In Unicode? You shouldn't have to care. But you may if your Perl is before 5.14.0 or you haven't specified use feature 'unicode_strings' or use 5.012 (or higher) because otherwise the semantics of the code points in the range 128 to 255 are different depending on whether the string they are contained within is in Unicode or not. (See When Unicode Does Not Happen in perlunicode.) To determine if a string is in Unicode, use: print utf8::is_utf8($string) ? 1 : 0, "\n"; But note that this doesn't mean that any of the characters in the string are necessary UTF-8 encoded, or that any of the characters have code points greater than 0xFF (255) or even 0x80 (128), or that the string has any characters at all. All the is_utf8() does is to return the value of the internal "utf8ness" flag attached to the $string . If the flag is off, the bytes in the scalar are interpreted as a single byte encoding. If the flag is on, the bytes in the scalar are interpreted as the (variable-length, potentially multi-byte) UTF-8 encoded code points of the characters. Bytes added to a UTF-8 encoded string are automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars are merged (double-quoted interpolation, explicit concatenation, or printf/sprintf parameter substitution), the result will be UTF-8 encoded as if copies of the byte strings were upgraded to UTF-8: for example, $a = "ab\x80c"; $b = "\x{100}"; print "$a = $b\n"; the output string will be UTF-8-encoded ab\x80c = \x{100}\n , but $a will stay byte-encoded. Sometimes you might really need to know the byte length of a string instead of the character length. For that use either the Encode::encode_utf8() function or the bytes pragma and the length() function: my $unicode = chr(0x100); print length($unicode), "\n"; # will print 1 require Encode; print length(Encode::encode_utf8($unicode)),"\n"; # will print 2 use bytes; print length($unicode), "\n"; # will also print 2 # (the 0xC4 0x80 of the UTF-8) no bytes; -- a Andy Bach, afb...@gmail.com 608 658-1890 cell 608 261-5738 wk -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/