On Aug 18, 2008, at 3:40 PM, mm w wrote:

to avoid the splitting problem

(c < 128) ? "%c" : "\\u%04x", c);

I'm not sure what this solves.

Per Michael's e-mail below, this is indeed a difficult problem. UTF-8 is just a particular scheme to store Unicode strings. Operating on individual bytes in such streams will most likely not make any sense.

What I would do is pick some normalized form and operate on that data. For a recent feature at my day job, we normalized all input CSV files to UTF-16BE. We were able to handle all of our customer data so far. The final solution still isn't 100% Unicode-savvy (e.g. it does crap-out with surrogate pairs), but we have unit tests to expose/ document such limitations. And, customer data doesn't yet have such things.


On Sat, Aug 16, 2008 at 7:43 AM, Michael Ash <[EMAIL PROTECTED]> wrote:
- It's very difficult to split UTF-8 strings correctly. If you
encounter a run of non-ASCII characters, ensure that you follow that
run through the end, until you get back to ASCII. Don't have a regex
that stops in the middle of it and then expects your code to be able
to do something useful with it.


___________________________________________________________
Ricky A. Sharp         mailto:[EMAIL PROTECTED]
Instant Interactive(tm)   http://www.instantinteractive.com



_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Reply via email to