Re: How to determine if a text file is UTF8 ?

Mark Waddingham via use-livecode Tue, 29 Oct 2024 09:25:05 -0700

On 2024-10-29 08:53, jbv via use-livecode wrote:

Hi list,


How to determine if a text file is UTF8 or just plain ASCII ?
In other words, how to know if one should use
  open file myfile.txt for UTF8 read
or
  open file myfile.txt for read

If it is really plain ASCII then it doesn't matter - UTF8 is a strictsuperset of ASCII.

All ASCII chars are identical - they are codes 0-127 so only use 7-bits(as ASCII does).

Any UTF-8 chars will start with a byte which has the high bit set sowill be in the range 128-255 - UTF-8 encoded chars are always at leasttwo bytes, and all those bytes have the top bit set.

If by 'ASCII' you mean 'some native encoding' like MacRoman or Latin-1(Windows 1252), then things are a bit more tricky. Unless the text filehas a byte-order-mark (BOM) at the front (which these days are becomingmuch less common) you can only really tell by guessing.

The simplest guess is to see if it 'roundtrips' as utf-8 and if it doesthen it is almost certainly utf-8; if it does not, then it is eitherunicode encoding (e.g. UTF-16 - which is often found on Windows), orsome other encoding (typically on mac this will be MacRoman, and onWindows this will be Latin-1 - but that's typically, there are 100's ofregion specific encodings so generally it depends on where the file camefrom / the locale of the computer it was created on - obviously withunicode this is not really an issue for new stuff, its more legacystuff).

So if you are faced with text files may either be the 'platform native'encoding (as LiveCode sees it) or utf-8 without a BOM:


  local tBinText, tText
  put url ("binfile:myfile.txt") into tBinText
  put textDecode(tBinText, "utf-8") into tText
  if textEncode(tText, "utf-8") is not tBinText then

-- If tText does not encode back to utf-8 identically, then itmeans there are invalid utf-8-- byte sequences in it which means that it is either a corruptedutf-8 file (unlikely) or

     -- not utf-8
     put textDecode(tBinText, "native") into tText
  else

-- If the first char is the unicode 'zero width no-break space' thenthat was a BOM which we-- don't want (the logic here is that that char makes no sense atthe start of a file so is-- reserved in that specific case to be used as a marker for unicodeencoding)

    if codeunit 1 of tText is numToCodepoint(0xFEFF) then
      delete codeunit 1 of tText
    end if
  end if

  -- Perform the general EOL conversion the engine would do reading text
  replace crlf with return in tText
  replace numToCodepoint(13) with return in tText

I'd estimate this probably 99% reliable - in order for a native encodedfile to *also* be valid UTF-8 is quite unlikely as for that to be thecase you'd need some very strange sequences of non-ascii characters(which tend to always be surrounded by ASCII - e.g. accented chars, mathsymbols, indices, quote variants).


Warmest Regards,

Mark.

--
Mark Waddingham ~ [email protected] ~ http://www.livecode.com/
LiveCode: Build Amazing Things

_______________________________________________
use-livecode mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: How to determine if a text file is UTF8 ?

Reply via email to