RE: Opening & writing to UTF-8 files; copyright symbol again -- solution

PHILLIPS M.E. Mon, 16 Nov 2015 01:50:10 -0800

The copyright symbol is not one of the characters for which there are two 
representations.


One thing that can confuse people about Unicode is the distinction between the 
“code point”[1] and the representation of the code point in the various Unicode 
transformation formats such as UTF-8, UTF-16, UTF-32 and so on.

The copyright symbol has code point A9 (represented in hexadecimal) in both 
ISO-Latin-1 and Unicode, more commonly written with some leading zeros, e.g. 
U+00A9. But when A9 is represented in UTF-8 the actual sequence of bytes in 
memory or in a file is C2 followed by A9.  In UTF-16 and UTF-32 you will see an 
A9 and enough zero bytes to pad to 2 or 4 bytes respectively, but there you 
will have the complication that the bytes may be in big-endian or little-endian 
order, i.e. A9 00 00 00 for little-endian, or 00 00 00 A9 for big endian.

I always find the www.fileformat.info<http://www.fileformat.info> pages useful 
for reference [2].

Matthew


[1] https://en.wikipedia.org/wiki/Code_point
[2] http://www.fileformat.info/info/unicode/char/a9/index.htm

From: Shelley Doljack [mailto:sdolj...@stanford.edu]
Sent: 13 November 2015 22:30
To: Highsmith, Anne L; perl4lib@perl.org
Subject: RE: Opening & writing to UTF-8 files; copyright symbol again -- 
solution

Hey, that’s my post! Anyways, I haven’t really looked into what your problem 
is, but when you said that the copyright character is getting transformed to A9 
even though it is supposedly stored as C2 A9 in the database, it made me think 
of how there can be two UTF-8 representations for the same character in some 
sections of the Unicode set. I wonder if that is somehow happening for you.

Shelley

RE: Opening & writing to UTF-8 files; copyright symbol again -- solution

Reply via email to