Re: Unicode question

Andy Bach Thu, 12 Jul 2012 13:41:28 -0700

On Thu, Jul 12, 2012 at 3:25 PM, Manfred Lotz <manfred.l...@arcor.de> wrote:
> This is really nice. I fumbled with unpack before but have to admit
> that I didn't know about 'use bytes' which is the key.


Couple interesting links, unpack in painful detail:
http://www.perlmonks.org/?node_id=224666

and utf-8 and "use bytes" info:
http://perldoc.perl.org/perluniintro.html  (search for 'use bytes' and
look around)

because
use bytes;

affects the whole script, you want to finish w/
no bytes;
How Do I Know Whether My String Is In Unicode?

You shouldn't have to care. But you may if your Perl is before 5.14.0
or you haven't specified use feature 'unicode_strings' or use 5.012
(or higher) because otherwise the semantics of the code points in the
range 128 to 255 are different depending on whether the string they
are contained within is in Unicode or not. (See When Unicode Does Not
Happen in perlunicode.)

To determine if a string is in Unicode, use:

    print utf8::is_utf8($string) ? 1 : 0, "\n";

But note that this doesn't mean that any of the characters in the
string are necessary UTF-8 encoded, or that any of the characters have
code points greater than 0xFF (255) or even 0x80 (128), or that the
string has any characters at all. All the is_utf8() does is to return
the value of the internal "utf8ness" flag attached to the $string . If
the flag is off, the bytes in the scalar are interpreted as a single
byte encoding. If the flag is on, the bytes in the scalar are
interpreted as the (variable-length, potentially multi-byte) UTF-8
encoded code points of the characters. Bytes added to a UTF-8 encoded
string are automatically upgraded to UTF-8. If mixed non-UTF-8 and
UTF-8 scalars are merged (double-quoted interpolation, explicit
concatenation, or printf/sprintf parameter substitution), the result
will be UTF-8 encoded as if copies of the byte strings were upgraded
to UTF-8: for example,

    $a = "ab\x80c";
    $b = "\x{100}";
    print "$a = $b\n";

the output string will be UTF-8-encoded ab\x80c = \x{100}\n , but $a
will stay byte-encoded.

Sometimes you might really need to know the byte length of a string
instead of the character length. For that use either the
Encode::encode_utf8() function or the bytes pragma and the length()
function:

    my $unicode = chr(0x100);
    print length($unicode), "\n"; # will print 1
    require Encode;
    print length(Encode::encode_utf8($unicode)),"\n"; # will print 2
    use bytes;
    print length($unicode), "\n"; # will also print 2
    # (the 0xC4 0x80 of the UTF-8)
    no bytes;
-- 

a

Andy Bach,
afb...@gmail.com
608 658-1890 cell
608 261-5738 wk

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: Unicode question

Reply via email to