On Oct 3, 2009, at 8:11 AM, Thomas Wetmore wrote:

While scanning the file I look for the attribute that specifies what the file should be, but I also do other checks. For example I check whether any of the upper half bytes are illegal ANSEL. And I check for UTF-8 multi-byte encodings. At the end I know whether the file is either valid ASCII, not ASCII but valid ANSEL, not ASCII or ANSET but vaild UTF-8, and if it's not valid as any of those three I assume it's UTF-16.

Can you check for a Unicode BOM for UTF-16, too? Regardless, instead of reading the file as an NSString, I'd recommend reading it into NSData, particularly if you want to look at raw bytes. NSString is not a generic byte container, and you can run into problems if you specify an incorrect encoding.

If the file is UTF-8 or UTF-16 I can just reread it with the correct encoding. However, if it is ANSEL I must do some delicate fiddling to convert it to Unicode.

I am relatively new to Cocoa and NSStrings, so this has lead to a few questions.

1. Apparently reading a file to an NSString using the NSASCIIStringEncoding returns each of the bytes of the file exactly as they were, that is, the 8-bit bytes seem to be read exactly as they were. So is it true that reading with NSASCIIStringEncoding doesn't mess around with any of the 8-bit bytes in the file?

I don't know if you can rely on this; NSData is safer, as I mentioned above.

2. Given I have an NSString that I read in as NSASCIIStringEncoding but I later determine it should have been read as UTF8 or UTF16, can I transform that NSString in place, or must I reread the file with the proper encoding? I don't mind doing the latter, but if there is conversion solution it would have better performance.

No, you'd need to reread it. However, if you read it as NSData, you can create the string using initWithData:encoding:.

3. I'm imagining two ways to do the ANSEL to UNICODE transformation to get the NSString. a. Create a C-array of 16-bit shorts and convert the ANSEL to pure UNICODE. Is there an API to convert a such a C-array of 16-bit shorts to an NSString?

NSString's initWithCharacters:length: will read a C array of Unichars (UTF-16).

b. Create a new NSString directly by building it up character by character. Would performance suffer greatly over the former approach?

I'd avoid that, unless you're dealing with small strings. If your conversion operates at the character level, stick with C arrays or NSMutableData; if you need to combine a unichar buffer with the convenience of NSMutableString, you can use CFStringCreateMutableWithExternalCharactersNoCopy, but that can be tricky.

  c. Is there an easier approach I am not seeing?

I noticed that CFStringEncoding lists kCFStringEncodingANSEL as an external encoding, but CFStringIsEncodingAvailable returns false, unfortunately. You could probably write a plugin for the Text Encoding Converter, but I've never tried that myself.

http://developer.apple.com/mac/library/documentation/Carbon/Conceptual/ProgWithTECM/tecmgr_about/tecmgr_about.html


Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to