The copyright symbol is not one of the characters for which there are two representations.
One thing that can confuse people about Unicode is the distinction between the “code point”[1] and the representation of the code point in the various Unicode transformation formats such as UTF-8, UTF-16, UTF-32 and so on. The copyright symbol has code point A9 (represented in hexadecimal) in both ISO-Latin-1 and Unicode, more commonly written with some leading zeros, e.g. U+00A9. But when A9 is represented in UTF-8 the actual sequence of bytes in memory or in a file is C2 followed by A9. In UTF-16 and UTF-32 you will see an A9 and enough zero bytes to pad to 2 or 4 bytes respectively, but there you will have the complication that the bytes may be in big-endian or little-endian order, i.e. A9 00 00 00 for little-endian, or 00 00 00 A9 for big endian. I always find the www.fileformat.info<http://www.fileformat.info> pages useful for reference [2]. Matthew [1] https://en.wikipedia.org/wiki/Code_point [2] http://www.fileformat.info/info/unicode/char/a9/index.htm From: Shelley Doljack [mailto:sdolj...@stanford.edu] Sent: 13 November 2015 22:30 To: Highsmith, Anne L; perl4lib@perl.org Subject: RE: Opening & writing to UTF-8 files; copyright symbol again -- solution Hey, that’s my post! Anyways, I haven’t really looked into what your problem is, but when you said that the copyright character is getting transformed to A9 even though it is supposedly stored as C2 A9 in the database, it made me think of how there can be two UTF-8 representations for the same character in some sections of the Unicode set. I wonder if that is somehow happening for you. Shelley