to avoid the splitting problem (c < 128) ? "%c" : "\\u%04x", c);
On Sat, Aug 16, 2008 at 7:43 AM, Michael Ash <[EMAIL PROTECTED]> wrote: > On Fri, Aug 15, 2008 at 10:53 PM, John Joyce > <[EMAIL PROTECTED]> wrote: >> Right now, I'm toying with using Flex/Lex in a Cocoa project. >> Unfortunately, I don't see a reliable or easy way to handle NSStrings >> correctly all the time with Flex. >> Does anybody have any suggestions for such text handling and reliable >> unicode aware regexes? >> I'm seriously not interested in implementing such details in C with Flex. >> Flex is fast and cool for that, but if it's going to be stupidly difficult >> to use reliably with other languages on a mac, it's not a good idea for me. > > Depending on exactly what you need, unicode awareness can be fairly > straightforward. > > Commonly, unicode in regexes is only needed to pass through > undifferentiated blobs of text, with ASCII delimiters. For example, > imagine parsing a CSV file which potentially has unicode text inside > the quotes. For this case, you can convert the file to UTF-8, and then > constructs like . will accept them. All non-ASCII characters in UTF-8 > are represented as bytes 128-255, so if you just pass those through > then you'll be fine. But be aware of some potential problem areas: > > - Each non-ASCII character will be more than one byte, and flex will > think of it as more than one character. Write your regexes > accordingly. In particular, avoid length limits on runs of arbitrary > characters, and avoid using non-ASCII characters directly in your > regex. > > - It's very difficult to split UTF-8 strings correctly. If you > encounter a run of non-ASCII characters, ensure that you follow that > run through the end, until you get back to ASCII. Don't have a regex > that stops in the middle of it and then expects your code to be able > to do something useful with it. > > - If you need to do something with non-ASCII characters besides read > them in one side and write them out the other, for example doing > something special with all accented characters, then Flex is probably > not the right answer. > > Besides this it ought to be pretty straightforward. Since Flex just > passes your code straight through to the compiler, you can write > Objective-C in the actions (as long as you compile the result as > Objective-C, of course!), convert the text from UTF-8 back to an > NSString, and take things from there. > > Mike > _______________________________________________ > > Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) > > Please do not post admin requests or moderator comments to the list. > Contact the moderators at cocoa-dev-admins(at)lists.apple.com > > Help/Unsubscribe/Update your Subscription: > http://lists.apple.com/mailman/options/cocoa-dev/openspecies%40gmail.com > > This email sent to [EMAIL PROTECTED] > -- -mmw _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [EMAIL PROTECTED]