This message is mostly for edification and mailing-list / searching posterity. Hopefully others will find it useful.
There has been support for Unicode in constant @"" NSStrings since Xcode 3.0, but it wasn't a widely known nor well documented feature (at least, that's my impression). Prior to Xcode 3.0 you were limited to 7-bit ASCII characters only, and creating Unicode strings required a bit of effort- the usual way was to create a UTF8 encoded C string, and then create a NSString from that, for example [NSString stringWithUTF8String:"\342\202\254 \303\237"]. I filed a bug (#5799172) to have the documentation updated to reflect the new functionality. Just got a note that the bug was being closed because it has been documented in the latest round of documentation updates. This feature is now 'officially documented', which is great news for anyone who has to deal with Unicode strings in their source files. http://developer.apple.com/documentation/Cocoa/Conceptual/ObjectiveC/Articles/chapter_950_section_5.html#//apple_ref/doc/uid/TP30001163-CH3-TPXREF104 Specifically, the section regarding @"string", which I'll copy and paste here: --- Defines a constant NSString object in the current module and initializes the object with the specified string. On Mac OS X v10.4 and earlier, the string must be 7-bit ASCII-encoded. On Mac OS X v10.5 and later (with Xcode 3.0 and later), you can also use UTF-16 encoded strings. (The runtime from Mac OS X v10.2 and later supports UTF-16 encoded strings, so if you use Mac OS X v10.5 to compile an application for Mac OS X v10.2 and later, you can use UTF-16 encoded strings.) --- First, a warning. The rest of this message should not be taken as an 'authoritative reference' on the topic as many of the statements below have been gleamed from a combination of reading the C99 standard, the GCC sources, educated guesses, and occasionally outright speculation. You have been warned. The first sentence from http://developer.apple.com/documentation/DeveloperTools/gcc-4.2.1/cpp/Character-sets.html#Character-sets sums it up nicely: "Source code character set processing in C and related languages is rather complicated." This is an understatement. GCC uses UTF-8 as its default 'source character set'. As a general rule of thumb, things will probably work out the way you expect them too if the source code that GCC is given to compile is encoded as UTF-8 (UTF-8 is a superset of 7-bit ASCII, i.e. all 7-bit ASCII is valid UTF-8). C99 also defines something called the 'execution character set'. This is the character set that the executing program will use and the character set that string literals will be converted to for their binary representations. For example, a source file in EBCDIC and using UTF-8 as the execution character set will perform the following conversion: EBCDIC bytes for "HIJK": 0xC8 0xC9 0xD1 0xD2 UTF-8 bytes for "HIJK": 0x48 0x49 0x4A 0x4B // Bytes that end up in the object file. It gets more complicated from here. Since Macs are already Unicode savvy and UTF-8 (or ASCII) is the default source character set, I'm just going to skip over these details. I mention this because, for me at least, I have a mental model that expects a string literal in source to always convert to the same sequence of bytes no matter what. That's actually not the case under C99 and it might catch you off guard. Thankfully, the GCC default of UTF-8 is likely to produce the results you're expecting. One gotcha is that Mac applications, such as Xcode, will use MacRoman as the default character set. If you're using Xcode, you may have to change the source character set encoding to UTF-8 to get things to work seamlessly. You can change this in Xcode by by doing a 'Get Info' on a source file, then choosing the 'General' tab. Roughly in the middle there should be a field for the file encoding. If you change the files encoding, I'm pretty sure Xcode will ask if you want to convert the contents of the file to the new encoding, but it's been awhile since I've had to do that. You can set the default file encoding using the Xcode preferences > Text Editing section. Now, back to Cocoa and Unicode in @"" strings. The documentation, at least as I read it, is slightly misleading. It is possible to have constant Unicode @"" NSStrings, but they do not necessarily have to be encoded as UTF-16 in your source code for you to take advantage of them. This part is speculation, but it's fairly reasonable. What I suspect happens under the hood is that the GCC code for @"" strings goes something like this: o Escape sequences, such as \n, \u, and \U are converted to their respective byte sequences in the source character set. o The GCC ObjC @"" 'function' examines the string literal: - If the string literal contains only 7-bit ASCII (or possibly MacRoman) characters, then the old / normal constant string object creation process is executed using the simple 8 bit representation of the string. The bytes for the string are stored in the __cstring object section. - If the string literal contains > 7-bit ASCII / Unicode characters, then the string is converted in to UTF-16 using the target architectures endian. The UTF-16 bytes are stored in the __ustring object section. Using UTF-8 as the source file encoding (again, this is the default for GCC, but might not be for your source files in Xcode, which I believe used to default to MacRoman), the following code 'just works': NSString *unicodeString0 = @"0: € ß"; // Stored in the source code as the UTF-8 sequence 30 3a 20 e2 82 ac 20 c3 9f. NSString at execution time: '0: € ß'. NSString *unicodeString1 = @"1: \u20ac \u00df"; // Using C99 \u style escapes. Requires -std=gnu99 (or equivalent) or the compiler will issue a warning. NSString at execution time: '1: € ß'. NSString *unicodeString2 = @"2: \342\202\254 \303\237"; // Octal escaped UTF-8 sequence. NSString at execution time: '2: € ß'. NSString *unicodeString3 = @"3: $ ss"; // 7-bit ASCII only, not converted to UTF-16, remains 8 bit. NSString at execution time: '3: $ ss' For the curious, the above also works when the source code is converted to UTF-16 and gcc is called with '-finput-charset=UTF-16 -std=gnu99'. Even the octal escaped UTF-8 sequence is correctly 'interpreted' and produces the 'correct' results. A small hitch I encountered when I gave this a try was that gcc decided that all source files were encoded as UTF-16, not just the source file in question. This is obviously a problem with #include / #import ed header files, which are almost certainly in ASCII / UTF-8. Maybe there's a way to correct that, but it was just a quick test to see what would happen. ----- The bottom line is that if you're using UTF-8 as your source code character set, you can now just copy and paste Unicode text straight in to your constant @"" strings. The compiler will automagically pick the best encoding for the characters in the string. This of course assumes that you're using Xcode 3.0+ / gcc 4+ on 10.5+ to compile said source code. The resulting object files / executables with the new Unicode string functionality will work all the way back to 10.2, however, it's not a 10.5+ only feature. And there's no UTF-16 endian issues to worry about since the compiler builds the UTF-16 strings separately for each architecture targeted. If you have Safari 3.1+ and Javascript enabled, you can take a look at some related documentation I recently wrote: http://regexkit.sourceforge.net/RegexKitLite/index.html#RegexKitLiteCookbook Again, if you're running a supported version of Safari (3.1+) and have Javascript enabled, then after the introductory paragraphs there should be a section titled "Enhanced Copy To Clipboard Functionality". Otherwise, if the browser your using isn't supported, the "Enhanced Copy To Clipboard Functionality" is disabled and that section remains hidden. It covers some of these same points and it also includes functionality to create NSStrings that contain Unicode that is then copied to the clipboard which you can then paste in to your source code. It also deals with escaping '\' backslashes and other problematic C string literal characters since its primary purpose is to simplify the process of correctly escaping a regular expression (which make heavy use of the '\' character) for use in a NSString / RegexKitLite. _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [EMAIL PROTECTED]