On Aug 18, 2008, at 3:40 PM, mm w wrote:
to avoid the splitting problem
(c < 128) ? "%c" : "\\u%04x", c);
I'm not sure what this solves.
Per Michael's e-mail below, this is indeed a difficult problem. UTF-8
is just a particular scheme to store Unicode strings. Operating on
individual bytes in such streams will most likely not make any sense.
What I would do is pick some normalized form and operate on that
data. For a recent feature at my day job, we normalized all input CSV
files to UTF-16BE. We were able to handle all of our customer data so
far. The final solution still isn't 100% Unicode-savvy (e.g. it does
crap-out with surrogate pairs), but we have unit tests to expose/
document such limitations. And, customer data doesn't yet have such
things.
On Sat, Aug 16, 2008 at 7:43 AM, Michael Ash <[EMAIL PROTECTED]>
wrote:
- It's very difficult to split UTF-8 strings correctly. If you
encounter a run of non-ASCII characters, ensure that you follow that
run through the end, until you get back to ASCII. Don't have a regex
that stops in the middle of it and then expects your code to be able
to do something useful with it.
___________________________________________________________
Ricky A. Sharp mailto:[EMAIL PROTECTED]
Instant Interactive(tm) http://www.instantinteractive.com
_______________________________________________
Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com
This email sent to [EMAIL PROTECTED]