I started this post of the DEV-LIST. Mark Waddingham kindly responded and smartly suggested I should move it to the USE-LIST, so that is what I am doing. I have also pasted Lark's reply below my original post.

---------------------- ORIGINAL POST ----------------------------------------

I have a LiveCode Script (LCS) routine that attempts to follow industry common algorithms for guessing the encoding of a text file.

It's performance can be slower than I would like.

This has led me to wonder in a LiveCode Builder (LCB) library may be the route to go. Does anyone know the OSX and/or Windows APIs for guessing a text file's encoding?

I have done a number of google searches, but I am not a C programmer (not in many decades) and wading through the huge doc sets at MSDN or Apple is daunting.

I found reference to a windows API:

BOOL IsTextUnicode( const VOID *lpv, int iSize, LPINT lpiResult );

Which suggests to me that such APIs may exists. Does anyone who is better at finding OS APIs know where to find such APIs? Can you point me to the right online documentation?

I also found this: https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding

Of course, it would be wonderful if the mothership delivered this. At one point Frasier said he would back around LC7 something.

https://quality.livecode.com/show_bug.cgi?id=14474

It seems an LCB library that uses OS APIs to return best guess for file encoding that match up with the textEncode/Decode functions would be a great addition to LC

 * "ASCII"
 * "UTF-16"
 * "UTF-16BE"
 * "UTF-16LE"
 * "UTF-32"
 * "UTF-32BE"
 * "UTF-32LE"
 * "UTF-8"
 * "CP1252"
 * "ISO-8859-1"
 * "MacRoman"

and I suppose "Binary" as the default if none of the above can be detected

----------------- MARK'S REPLY ----------------------------------------
On 2019-09-13 16:44, Paul Dupuis wrote:
> I have a LiveCode Script (LCS) routine that attempts to follow
> industry common algorithms for guessing the encoding of a text file.
>
> It's performance can be slower than I would like.

If you share your code perhaps we can help speed it up...

> This has led me to wonder in a LiveCode Builder (LCB) library may be
> the route to go. Does anyone know the OSX and/or Windows APIs for
> guessing a text file's encoding?
>
> I have done a number of google searches, but I am not a C programmer
> (not in many decades) and wading through the huge doc sets at MSDN or
> Apple is daunting.
>
> I found reference to a windows API:
>
> BOOL IsTextUnicode(
>   const VOID *lpv,
>   int        iSize,
>   LPINT      lpiResult
> );
>
>  Which suggests to me that such APIs may exists. Does anyone who is
> better at finding OS APIs know where to find such APIs? Can you point
> me to the right online documentation?

Libraries certainly exist: Mozilla has a 'universal charset detector library' for example, which appears to use various statistical heuristics to tell between all kinds of encodings.

The 'IsTextUnicode' API seems to just tell you whether a sequence of bytes is likely to be UTF-16 or not UTF-16; so probably won't be all that helpful if that isn't all you are wanting to distinguish between.

Do you have a list of encodings you are needing to guess between? That will generally influence how fast (and accurate) you can make such a function (its almost trivial to detect UTF-8 with a high degree of confidence, UTF-32 I think as well, UTF-16 is somewhat harder, and distinguishing between single-byte and legacy multi-byte charsets is, relatively speaking, very hard).

Warmest Regards,

Mark.

P.S. This might be a better discussion to have on the use-list unless there is a reason not to, it might be of interest to others in that wider group.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
livecode-dev mailing list
livecode-...@lists.runrev.com
http://lists.runrev.com/mailman/listinfo/livecode-dev

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to