RE: Determining Text File Encoding

Paul H. Tarver Wed, 01 Aug 2018 12:55:55 -0700

Currently I'm checking the first two bytes and if they are 255 & 254 
respectively, I run STRCONV(textdata,6) on file contents and resave the text 
file to a temp file and it seems to do the trick and I can then import the 
tab-delimited data from the temp file with no problem.


I've only run into this a few times before and my method has worked pretty well 
so far, but I thought I would run it by the group. 

Thanks!

Paul 

-----Original Message-----
From: ProfoxTech [mailto:[email protected]] On Behalf Of Fernando D. 
Bozzo
Sent: Wednesday, August 01, 2018 1:32 PM
To: [email protected]
Subject: Re: Determining Text File Encoding

AFAIK there is no way to determine the exact encoding of the files. You can
do a "best effort" algorithm to try identifying it, but even Notepad++
sometimes fails to show the correct encoding.

That's why XML, HTML and some other metalanguages use the
[encoding="utf-8"] or [charset="utf-8"] or similar, because this must be
explicitly indicated for not misunderstanding the contents.

In similar way, when delivering text files to someone, an encoding must be
explicitly defined and agreed between the parts to not misinterpret the
contents.

UTF-16 is a little strange for me and never did deal with it, isn't used
for double byte characters, like chinese or similar?

One idea that comes to me is that you can ask for a header indicating the
encoding (like XML does), or even ask for a predefined string (always the
same, like "Test header - áàä") [with some special chars] which you can
compare to your own. If the comparison of the source string in UTF-16 does
not match your string in UTF-16, then you can assume it's UTF-8, or
re-check comparing with the same string in UTF-8


Regards.-


2018-08-01 20:00 GMT+02:00 Paul H. Tarver <[email protected]>:

> Ok, this may be a dumb question, but is there a reliable and easy way to
> detect and determine the file encoding on simple text files?
>
>
>
> I have a client sending me files with UTF-16 Little Endian encoding. I have
> some code in place to try to determine if a file is UNICODE based on the
> first two or four characters once the file is loaded to memory and then
> convert it using STRCONV, but I'm concerned that although it works, it is a
> bit of a hack and maybe there is a better way.
>
>
>
> Any thoughts?
>
>
>
> Paul
>
>
>
>
>
> --- StripMime Report -- processed MIME parts ---
> multipart/alternative
>   text/plain (text body -- kept)
>   text/html
> ---
>
[excessive quoting removed by server]

_______________________________________________
Post Messages to: [email protected]
Subscription Maintenance: http://mail.leafe.com/mailman/listinfo/profox
OT-free version of this list: http://mail.leafe.com/mailman/listinfo/profoxtech
Searchable Archive: http://leafe.com/archives/search/profox
This message: 
http://leafe.com/archives/byMID/profox/[email protected]
** All postings, unless explicitly stated otherwise, are the opinions of the 
author, and do not constitute legal or medical advice. This statement is added 
to the messages for those lawyers who are too stupid to see the obvious.

RE: Determining Text File Encoding

Reply via email to