Re: Guessing the encoding of a test file...

Paul Dupuis via use-livecode Fri, 20 Mar 2020 08:37:16 -0700

To Sean and Bob,

Thank you for your replies. I may not have been clear enough in myoriginal post:

We make and sell an App for macOS and Windows. It's uses around theworld by researchers (not a lot of them as it is a niche product) ontheir computers. The research applications allows input of data fromtext files. The sources of those text files are from various sourcethose researcher have. It would negatively impact our competitiveness inour market if we forced the users to convert their data all to somespecific text encoding, so we need to try to "guess" the encoding ofthose text files.

There are many published algorithms for doing this and we have a pastcontractor of ours take a "best practice" algorithm and create a LCS"guessEncoding function. This replaced a previous guessEncoding functionwe had that from Richard Gaskin, which while quite good, did not coveras many test cases and the newer more robust one.

My main question to the list was: Has anyone out there ALSO written aguessEncoding function they might like to share or license?

Why did I ask this? Because I am interested in comparing the accuracy ofour current handler to any other that may be available as, users beingusers, we recently have a user reveal a bug (mis named variable) in ourcurrent function that meant it was missing certain edge cases ( and thisuser has hundreds of text files that need this edge case to be properlyrecognized as MAcRoman encoding. So that bug has been fixed, but I amstill interested in comparing any other giessEncoding routines to ourcurrent one to see if we can do better that we current are.


To Mark,

As always, thank for reading and responding Mark. We're actually doingwhat you suggest. We had a set of QA test cases (text files in manydifferent line endings and encodings), some intended to fail (such asWindows Code Page's we don't support). We're expanding these and doing areview on macOS and Windows with our app. Ones that fail, that we thinkshouldn't fail, we will step through the code to see why they fail andif our algorithm can be further enhanced. I can's foresee any algorithmtweaks we can't code ourselves that we'd need LC or USE-LIST assistance for.

Back around LiveCode 7, Fraiser said, in response to some correspondenceI had with him, that he would consider creating a "guessEncoding" to goalong with the Unicode Everywhere work and the new textEncode/textDecodefunctions. I do understand the reluctance, as a business, to do so, asinevitably there will be some instances where it guesses wrong. Otherthan LC adding a guessEncoding function using some open source library,I would say the area where LC could be the most help would be with thisenhancement https://quality.livecode.com/show_bug.cgi?id=22391

I am under the, perhaps false, impression that isoToMac and macToIso aresort of viewed as functions that may become deprecated and no longerupdated in the future. However, they are still essential for us until Ican textDecode(someData,"MacRoman") on a Windows system and vice versa.




_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Guessing the encoding of a test file...

Reply via email to