On 1/3/07, Octavian Rasnita <[EMAIL PROTECTED]> wrote:
From: "Jay Savage" <[EMAIL PROTECTED]>
> Try to unpack the data--or a chunk of data you feel is large enough to
> be representative--with the pattern U0U*. If the unpack succeeds with
> no warnings, you have valid utf8. You could try the same thing with
> Encode's 'decode_utf8' routine. See perluniintro for details. in both
> cases, though, you need to make sure that you've grabbed well-formed
> utf8 from the source file in the first place. If the data cuts off in
> the middle of a multi-byte character, you'll get an error.

I have tried verifying the entire string, using the following:

my $result = unpack("U0U*", $content);
print $result;

Well, it gave no errors even though the string was UTF-8 or not, but an
interesting thing is that the result printed was always 65279 if the string
was UTF-8 and 112 or 116 if the string was not UTF-8.

Do you know what represent these numbers? I am curious why sometimes it
prints 112 and sometimes 116 when using some ansi strings.
I hope the result is consistent and I can base on it to use the code in my
program for checking if a string is UTF-8.

Thank you.

Octavian

Unpack returns a list, so $result gets the value of the first itme of
the list. Offhand, I'd say the first character of your utf-8 string
was the three-byte character "0xfeff" (zero-width no-break space).
That also happens to be the two-byte byte order mark (BOM) for the
beginning of a big-endian utf-16 stream (if you see 65534 ["0xfffe"]
it's little-endian). If all of your data behaves so nicely, you can
just look for the BOM. Note, though, that according to the standard,
this data is really big-endian utf-16, not utf-8, although it may only
use utf-8 code points.

As for 112 and 116, I'd say all you ascii data began with "p" ot "t"
(or something else that perl interpreted as those code points). Keep
in mind that most ascii data is perfectly well-formed utf-8. If what
you want to do is separate ascii from utf-8, test for asciiness and
treat the rest as utf-8.

HTH,

-- jay
--------------------------------------------------
This email and attachment(s): [  ] blogable; [ x ] ask first; [  ]
private and confidential

daggerquill [at] gmail [dot] com
http://www.tuaw.com  http://www.downloadsquad.com  http://www.engatiki.org

values of β will give rise to dom!

Reply via email to