On 13 Jan 2011, at 01:55, Jeff Massung wrote:

> - Next, determine text vs. binary. This is usually done by just grabbing the
> first N (where N is ~1000) bytes and look for any that are < 10 or > 127. If
> you find any, it's binary - or unicode.

This is only true if the text is 7-bit encoded which is very, very rare these 
days. (In fact, it isn't totally true as 0 to 9 are valid ASCII characters 
though not often found in files). The default text encoding on Mac Classic 
(MacRoman) and Windows (Codepage 1252 in US & Western Europe) are both 8-bit 
encoded. The above test would only work if no accented characters were used in 
text. 

> Remember that while UTF8 is not ASCII, it's designed to be indistinguishable
> from ASCII most of the time. I don't have any advice to give you here on how
> to determine if the file is unicode text or not... as I understand it this
> is really a difficult problem to solve. I'm sure Google can help, though.
> ;-)

UTF-8 is designed to be indistinguishable from 7-bit encoded ASCII (characters 
0 - 127 are identical in both encoding systems). However, the use of characters 
coded in the range 128 - 255 is very different between UTF-8, Windows Codepages 
and MacRoman).

Regards

Peter 


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to