Re: How to determine if a text file is UTF8 ?

Ben Rubinstein via use-livecode Wed, 30 Oct 2024 03:19:04 -0700

Thanks for the handy tip re validating UTF8, which I didn't know.

Can I use this opportunity to make the plea once again to support the basicencodings on any platform, rather than relying on the hated "native", i.e.

https://quality.livecode.com/show_bug.cgi?id=12205

and bearing in mind the comments of one M*rk W*dd*ingh*m in 2014:

In any case, I can't argue with the suggestion that the language parameter 
should enable at least the most common charsets to be leveraged and converted 
to/from Unicode.

...

At least for release we are aiming to have the list above working on all 
platforms. (UTF-8, UTF-16 (BE and LE), UTF-32 (BE and LE), MacRoman, ISO8859-1, 
Windows-1252, ASCII)

(comments on https://quality.livecode.com/show_bug.cgi?id=3674)

Thanks for listening!

Ben

On 29/10/2024 16:23, Mark Waddingham via use-livecode wrote:

On 2024-10-29 08:53, jbv via use-livecode wrote:
Hi list,

How to determine if a text file is UTF8 or just plain ASCII ?
In other words, how to know if one should use
  open file myfile.txt for UTF8 read
or
  open file myfile.txt for read
If it is really plain ASCII then it doesn't matter - UTF8 is a strict supersetof ASCII.
All ASCII chars are identical - they are codes 0-127 so only use 7-bits (asASCII does).
Any UTF-8 chars will start with a byte which has the high bit set so will bein the range 128-255 - UTF-8 encoded chars are always at least two bytes, andall those bytes have the top bit set.
If by 'ASCII' you mean 'some native encoding' like MacRoman or Latin-1(Windows 1252), then things are a bit more tricky. Unless the text file has abyte-order-mark (BOM) at the front (which these days are becoming much lesscommon) you can only really tell by guessing.
The simplest guess is to see if it 'roundtrips' as utf-8 and if it does thenit is almost certainly utf-8; if it does not, then it is either unicodeencoding (e.g. UTF-16 - which is often found on Windows), or some otherencoding (typically on mac this will be MacRoman, and on Windows this will beLatin-1 - but that's typically, there are 100's of region specific encodingsso generally it depends on where the file came from / the locale of thecomputer it was created on - obviously with unicode this is not really anissue for new stuff, its more legacy stuff).
So if you are faced with text files may either be the 'platform native'encoding (as LiveCode sees it) or utf-8 without a BOM:
   local tBinText, tText
   put url ("binfile:myfile.txt") into tBinText
   put textDecode(tBinText, "utf-8") into tText
   if textEncode(tText, "utf-8") is not tBinText then
-- If tText does not encode back to utf-8 identically, then it meansthere are invalid utf-8 -- byte sequences in it which means that it is either a corrupted utf-8file (unlikely) or
      -- not utf-8
      put textDecode(tBinText, "native") into tText
   else
-- If the first char is the unicode 'zero width no-break space' then thatwas a BOM which we -- don't want (the logic here is that that char makes no sense at thestart of a file so is -- reserved in that specific case to be used as a marker for unicodeencoding)
     if codeunit 1 of tText is numToCodepoint(0xFEFF) then
       delete codeunit 1 of tText
     end if
   end if

   -- Perform the general EOL conversion the engine would do reading text
   replace crlf with return in tText
   replace numToCodepoint(13) with return in tText
I'd estimate this probably 99% reliable - in order for a native encoded fileto *also* be valid UTF-8 is quite unlikely as for that to be the case you'dneed some very strange sequences of non-ascii characters (which tend to alwaysbe surrounded by ASCII - e.g. accented chars, math symbols, indices, quotevariants).
Warmest Regards,

Mark.


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: How to determine if a text file is UTF8 ?

Reply via email to