On 1/3/07, Octavian Rasnita <[EMAIL PROTECTED]> wrote:
Hi,
I want to check if a certain string is UTF-8 or not.
I have tried using is_utf8 from the Encode module, and utf8::is_utf8() but
the string is detected wrong.
For example, if I have a UTF-8 encoded file and an ANSI encoded file, if I
open them both without "<:utf8", is_utf8 shows that they are not UTF-8
strings, and if I open the files using "<:utf8", then is_utf8 shows that
they both are UTF-8 strings.
I want to detect which file is UTF-8 encoded and which is not.
Actually, I want to get a text from a database and check if it is UTF-8
encoded, and if it is not, to encode it as UTF-8, because I don't want to
encode a text as UTF-8 twice.
Can you tell me how can this be done?
Thank you.
Octavian
Try to unpack the data--or a chunk of data you feel is large enough to
be representative--with the pattern U0U*. If the unpack succeeds with
no warnings, you have valid utf8. You could try the same thing with
Encode's 'decode_utf8' routine. See perluniintro for details. in both
cases, though, you need to make sure that you've grabbed well-formed
utf8 from the source file in the first place. If the data cuts off in
the middle of a multi-byte character, you'll get an error.
This may sound like a kludge, and it is. Perl has no way of knowing
whether your data is utf8 data; it's just a stream of bytes, or maybe
just bits. You have to tell perl whether to interpret those bytes as a
particular character encoding, or just let it guess. Why? Because file
formats aren't mutually exclusive. There is nothing to prevent unicode
or ascii characters from appearing in other file types. You could have
a JPEG image composed entirely of bytes that correspond to unicode
characters. Encodings like uuencode are designed to ascii armor binary
files. Only the programmer knows what the input is supposed to be, and
what sort of conversion should take place.
HTH,
-- jay
--------------------------------------------------
This email and attachment(s): [ ] blogable; [ x ] ask first; [ ]
private and confidential
daggerquill [at] gmail [dot] com
http://www.tuaw.com http://www.downloadsquad.com http://www.engatiki.org
values of β will give rise to dom!