Hi,

> length() returns the length in characters, which
> for ASCII is also the number of bytes. To get
> the bits, just multiply by 8.

> If you are using a Unicode character set
> instead, I'm not too sure what will be returned,
> or how you can convert it to bits.

Unicode can get pretty hairy, but it's my
impression that the number of bytes per character
varies depending on your encoding. UTF-8, the
defacto standard nowadays, has variable length
encoding -- characters can take between 3 and 6
bytes, if I recall correctly.

I was curious about trying this out, so I modified
a crufty little script I had hanging around. The
bottom line was that length returns characters
too, just, Unicode characters. Combining
characters count as unique.

Anyway, I use it like this:

    $ perl describechars in.utf8 > out.utf8

Then you can view out.utf8 with an editor that
can grok whatever language you happen to be
dealing with.

Sure enough, Perl counts characters, not bytes,
with Unicode text.

More scintillating details at:

   http://www.perldoc.com/perl5.8.0/pod/perlun-
   icode.html

If you have the module Unicode::CharName (not
sure if that's core nowadays), you can try out my
goofy script:


#!/usr/bin/perl -w
use Unicode::CharName qw(uname ublock);
use strict;

my @chars = ();

while (<>) {
    chomp;
    print "~-" x 15, "\n";
    $_ =~ s/^\s+//;
    $_ =~ s/\s+$//;

    @chars = split //, $_;
    print "$_\n";    # the line
    print join ' + ', @chars;    
    # the individual chars
    print "\nlength is: ", length($_);
    print "\n";
    for my $char (@chars) {
        print "[ $char ]\t";
        print uname( ord($char) ), 
          # uname prints Unicode names.
          "\t", hex( ord($char) ),
          "\n";                  
    }
}

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to