Dar Scott wrote:

Yeah, there is no need to use binfile, but it is OK.   You can process the line 
ends before or after converting to Unicode, if you do.

Not too cautious for not knowing.  It is a normal and right approach to be 
aware of potential problems and make code robust for those, but now you know.

Assuming a valid UTF-8 file...

Only the ASCII characters in UTF-8 have the high bit zero.  They are 
represented as single bytes.  (ASCII files are UTF-8 files.)  All other 
characters are represented with multiple bytes that have the high bit set, not 
just the first but even the following.  (The first byte in binary is 11xxxxxx 
and the continuing bytes are 10xxxxxx.)

This means there are no CR, LF, tab, or comma hidden in the non-ASCII 
characters.  ASCII never has the high bit set.  You can use line and item 
chunks with UTF-8.  You can use offset (with care) and replace.

Thanks for that background, Dar. I had suspected there may have been something that makes such distinctions identifiable, but didn't know the details. Now I can use "file" with confidence (and less work handling line endings).

Really nice to have you back on this list.

--
 Richard Gaskin
 Fourth World
 LiveCode training and consulting: http://www.fourthworld.com
 Webzine for LiveCode developers: http://www.LiveCodeJournal.com
 Follow me on Twitter:  http://twitter.com/FourthWorldSys

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to