Hi Keith,

This might need input from the mothership, but I think if you've obtained the text from the browser widget's htmlText, it will probably be in the special 'internal' format. I'm not entirely sure what happens when you save that as text - I suspect it depends on the platform.

So for clarity (if you have the opportunity to re-save this material; and if it won't confuse things because existing files are in one format, and new ones another) it would probably be best to textEncode it into UTF-8, then save it as binfile. That way the files on disk should be UTF-8, which is something like a standard.

What I tend to do in this situation where I have text files and I'm not sure what the format is (and I spend quite a lot of time messing with text files from various sources, some unknown and many not under my control) is use a good text editor - I use BBedit on Mac, not sure what suitable alternatives would be on Windows or Linux - to investigate the file. BBEdit makes a guess when it opens the file, but allows you to try re-opening in different encodings, and then warns you if there are byte sequences that don't make sense with that encoding. So by doing this I can often figure out what the encoding of the file is - once you've got that, you're off to the races.

But if you have the opportunity to re-collect the whole set, then I *think* the above formula of textEncoding from LC's internal format to UTF-8, then saving as binary file; and reversing the process when you load them back in to process; and then doing the same again - possibly to a different format - when you output the CSV, should see you clear.

HTH,

Ben


On 17/05/2021 15:58, Keith Clarke via use-livecode wrote:
Thanks Ben, that’s really interesting. It never occurred to me that these html 
files might be anything other than simple plain text files, as I’d work with in 
Coda, etc., for years.

The local HTML files are storage of the HTML text pulled from the LiveCode 
browser widget, saved using the URL ‘file:’ option. I’d been working ‘live’ 
from the Browser widget’s html text until recently, when I’ve introduced these 
local files to split page ‘crawling’ and analysis activities without needing a 
database.

Reading the files back into LiveCode with the URL ‘file:’ option works quite happily with no 
text anomalies when put into a field to read. The problem seems to arise when I load the HTML 
text into a variable and then start to extract elements using LiveCode's text chunking. For 
example pulling the text between the offsets of say <p> & </p> tags is when 
these character anomalies have started to pop into the strings.

A quick test on reading in the local HTML files with the URL ‘binfile:’ option 
and then textDecode(tString, “UTF-8”) seems to reduce the frequency and size of 
anomalies, but some remain. So, I’ll see if re-crawling pages and saving the 
HTML text from the browser widget as binfiles reduces this further.
Thanks & regards,
Keith

On 17 May 2021, at 12:57, Ben Rubinstein via use-livecode 
<use-livecode@lists.runrev.com> wrote:

Hi Keith,

The thing with character encoding is that you always need to know where it's 
coming from and where it's going.

Do you know how the HTML documents were obtained? Saved from a browser, fetched 
by curl, fetched by Livecode? Or generated on disk by something else?

If it was saved from a browser or fetched by curl, then the format is most 
likely to be UTF-8. In order to see it correctly in LiveCode, you'd need to two 
things:
        - read it in as a binary file, rather than text (e.g. use URL "binfile://..." or 
"open file ... for binary read")
        - convert it to the internal text format FROM UTF-8 - which means use 
textDecode(tString, "UTF-8"), rather than textEncode

If it was fetched by LiveCode, then it most likely arrived over the wire as 
UTF-8, but if it was saved by LiveCode as text (not binary) then it _may_ have 
got corrupted.

If you can see the text looking as you expect in LiveCode, you've solved half the 
problem. Then you need to consider where it's going: who (that) is going to consume the 
CSV. This is the time to use textEncode, and then be sure to save it as a binary file. If 
the consumer will be something reasonably modern, then again UTF-8 is a good default. If 
it's something much older, you might need to use "CP1252" or similar.

HTH,

Ben


On 17/05/2021 09:28, Keith Clarke via use-livecode wrote:
Hi folks,
I’m using LiveCode to summarise text from HTML documents into csv summary files and 
am noticing that when I extract strings from html documents stored on disk - rather 
than visiting the sites via the browser widget & grabbing the HTML text - weird 
characters being inserted in place of what appear to be ‘regular’ characters.
The number of characters inserted can run into the thousands per instance, 
making my csv ‘summary’ file run into gigabytes! Has anyone seen the following 
type of string before, happen to know what might be causing it and offer a fix?
‚Äö
I’ve tried deliberately setting UTF-8 on the extracted strings, with put 
textEncode(tString, "UTF-8") into tString. Currently I’m not attempting to 
force any text format on the local HTML documents.
Thanks & regards,
Keith
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to