Unicode in constant @"" NSStrings

John Engelhart Fri, 28 Nov 2008 22:34:01 -0800

This message is mostly for edification and mailing-list / searching
posterity.  Hopefully others will find it useful.

There has been support for Unicode in constant @"" NSStrings since
Xcode 3.0, but it wasn't a widely known nor well documented feature
(at least, that's my impression). Prior to Xcode 3.0 you were limited
to 7-bit ASCII characters only, and creating Unicode strings required
a bit of effort- the usual way was to create a UTF8 encoded C string,
and then create a NSString from that, for example [NSString
stringWithUTF8String:"\342\202\254 \303\237"]. I filed a bug
(#5799172) to have the documentation updated to reflect the new
functionality.

Just got a note that the bug was being closed because it has been
documented in the latest round of documentation updates. This feature
is now 'officially documented', which is great news for anyone who has
to deal with Unicode strings in their source files.

http://developer.apple.com/documentation/Cocoa/Conceptual/ObjectiveC/Articles/chapter_950_section_5.html#//apple_ref/doc/uid/TP30001163-CH3-TPXREF104

Specifically, the section regarding @"string", which I'll copy and paste here:

---
Defines a constant NSString object in the current module and
initializes the object with the specified string.

On Mac OS X v10.4 and earlier, the string must be 7-bit ASCII-encoded.
On Mac OS X v10.5 and later (with Xcode 3.0 and later), you can also
use UTF-16 encoded strings. (The runtime from Mac OS X v10.2 and later
supports UTF-16 encoded strings, so if you use Mac OS X v10.5 to
compile an application for Mac OS X v10.2 and later, you can use
UTF-16 encoded strings.)
---

First, a warning. The rest of this message should not be taken as an
'authoritative reference' on the topic as many of the statements below
have been gleamed from a combination of reading the C99 standard, the
GCC sources, educated guesses, and occasionally outright speculation.
You have been warned.

The first sentence from
http://developer.apple.com/documentation/DeveloperTools/gcc-4.2.1/cpp/Character-sets.html#Character-sets
sums it up nicely:

"Source code character set processing in C and related languages is
rather complicated."

This is an understatement.

GCC uses UTF-8 as its default 'source character set'. As a general
rule of thumb, things will probably work out the way you expect them
too if the source code that GCC is given to compile is encoded as
UTF-8 (UTF-8 is a superset of 7-bit ASCII, i.e. all 7-bit ASCII is
valid UTF-8).

C99 also defines something called the 'execution character set'. This
is the character set that the executing program will use and the
character set that string literals will be converted to for their
binary representations. For example, a source file in EBCDIC and
using UTF-8 as the execution character set will perform the following
conversion:

EBCDIC bytes for "HIJK": 0xC8 0xC9 0xD1 0xD2
UTF-8 bytes for "HIJK": 0x48 0x49 0x4A 0x4B // Bytes that end up in
the object file.

It gets more complicated from here. Since Macs are already Unicode
savvy and UTF-8 (or ASCII) is the default source character set, I'm
just going to skip over these details. I mention this because, for me
at least, I have a mental model that expects a string literal in
source to always convert to the same sequence of bytes no matter what.
That's actually not the case under C99 and it might catch you off
guard. Thankfully, the GCC default of UTF-8 is likely to produce the
results you're expecting. One gotcha is that Mac applications, such
as Xcode, will use MacRoman as the default character set. If you're
using Xcode, you may have to change the source character set encoding
to UTF-8 to get things to work seamlessly. You can change this in
Xcode by by doing a 'Get Info' on a source file, then choosing the
'General' tab. Roughly in the middle there should be a field for the
file encoding. If you change the files encoding, I'm pretty sure
Xcode will ask if you want to convert the contents of the file to the
new encoding, but it's been awhile since I've had to do that. You can
set the default file encoding using the Xcode preferences > Text
Editing section.

Now, back to Cocoa and Unicode in @"" strings. The documentation, at
least as I read it, is slightly misleading. It is possible to have
constant Unicode @"" NSStrings, but they do not necessarily have to be
encoded as UTF-16 in your source code for you to take advantage of
them.

This part is speculation, but it's fairly reasonable. What I suspect
happens under the hood is that the GCC code for @"" strings goes
something like this:

o Escape sequences, such as \n, \u, and \U are converted to their
respective byte sequences in the source character set.

o The GCC ObjC @"" 'function' examines the string literal:
- If the string literal contains only 7-bit ASCII (or possibly
MacRoman) characters, then the old / normal constant string object
creation process is executed using the simple 8 bit representation of
the string. The bytes for the string are stored in the __cstring
object section.
- If the string literal contains > 7-bit ASCII / Unicode characters,
then the string is converted in to UTF-16 using the target
architectures endian. The UTF-16 bytes are stored in the __ustring
object section.

Using UTF-8 as the source file encoding (again, this is the default
for GCC, but might not be for your source files in Xcode, which I
believe used to default to MacRoman), the following code 'just works':

NSString *unicodeString0 = @"0: € ß"; // Stored in the source code
as the UTF-8 sequence 30 3a 20 e2 82 ac 20 c3 9f. NSString at
execution time: '0: € ß'.
NSString *unicodeString1 = @"1: \u20ac \u00df"; // Using C99 \u
style escapes. Requires -std=gnu99 (or equivalent) or the compiler
will issue a warning. NSString at execution time: '1: € ß'.
NSString *unicodeString2 = @"2: \342\202\254 \303\237"; // Octal
escaped UTF-8 sequence. NSString at execution time: '2: € ß'.
NSString *unicodeString3 = @"3: $ ss"; // 7-bit ASCII only, not
converted to UTF-16, remains 8 bit. NSString at execution time: '3: $
ss'

For the curious, the above also works when the source code is
converted to UTF-16 and gcc is called with '-finput-charset=UTF-16
-std=gnu99'. Even the octal escaped UTF-8 sequence is correctly
'interpreted' and produces the 'correct' results. A small hitch I
encountered when I gave this a try was that gcc decided that all
source files were encoded as UTF-16, not just the source file in
question. This is obviously a problem with #include / #import ed
header files, which are almost certainly in ASCII / UTF-8. Maybe
there's a way to correct that, but it was just a quick test to see
what would happen.

-----

The bottom line is that if you're using UTF-8 as your source code
character set, you can now just copy and paste Unicode text straight
in to your constant @"" strings. The compiler will automagically pick
the best encoding for the characters in the string. This of course
assumes that you're using Xcode 3.0+ / gcc 4+ on 10.5+ to compile said
source code. The resulting object files / executables with the new
Unicode string functionality will work all the way back to 10.2,
however, it's not a 10.5+ only feature. And there's no UTF-16 endian
issues to worry about since the compiler builds the UTF-16 strings
separately for each architecture targeted.

If you have Safari 3.1+ and Javascript enabled, you can take a look at
some related documentation I recently wrote:
http://regexkit.sourceforge.net/RegexKitLite/index.html#RegexKitLiteCookbook
Again, if you're running a supported version of Safari (3.1+) and
have Javascript enabled, then after the introductory paragraphs there
should be a section titled "Enhanced Copy To Clipboard Functionality".
Otherwise, if the browser your using isn't supported, the "Enhanced
Copy To Clipboard Functionality" is disabled and that section remains
hidden. It covers some of these same points and it also includes
functionality to create NSStrings that contain Unicode that is then
copied to the clipboard which you can then paste in to your source
code. It also deals with escaping '\' backslashes and other
problematic C string literal characters since its primary purpose is
to simplify the process of correctly escaping a regular expression
(which make heavy use of the '\' character) for use in a NSString /
RegexKitLite.
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Unicode in constant @"" NSStrings

Reply via email to