On Sat, 2021-12-18 at 20:40 -0500, Tom Horsley wrote:
> Just a (possibly) relevant note: I've seen many html pages with
> headers claiming they are UTF-8, but text that only displays
> correctly if you treat them as one of the windows code pages.
>
> Worse yet, some browsers have heuristics to det
On 19/12/2021 09:50, Michael D. Setzer II wrote:
%10.10s and
%20.20s both would cause the problem.
I believe those are both printf format indicators. Which is why I was wondering
if converting to plain text would be better
because those would be removed (dealt with) during the convert.
--
Did
On 19 Dec 2021 at 9:14, Ed Greshko wrote:
From: Ed Greshko
Date sent: Sun, 19 Dec 2021 09:14:37 +0800
Subject:Re: Having strange result on processing UTF-8 file
To: "Michael D. Setzer II" ,
Community support for Fedora users
Send reply to: Community s
On Sun, 19 Dec 2021 09:14:37 +0800
Ed Greshko wrote:
> Does it contain html?
Just a (possibly) relevant note: I've seen many html pages with headers
claiming they are UTF-8, but text that only displays correctly if you
treat them as one of the windows code pages.
Worse yet, some browsers have he
On 19/12/2021 08:31, Michael D. Setzer II wrote:
But could change if they add more or remove some currently 633 records. Some
lines in the file are over 25000 characters?? Total download is about 13M.
The actual lines I need for the data are just 256K, so it has lots of junk
(stuff I don't nee
On 19 Dec 2021 at 7:54, Ed Greshko wrote:
From: Ed Greshko
Date sent: Sun, 19 Dec 2021 07:54:31 +0800
Subject:Re: Having strange result on processing
UTF-8 file
To: users@lists.fedoraproject.org
Send reply to: Community
On 19/12/2021 02:15, Michael D. Setzer II via users wrote:
Download 64 web pages into a single file using wget2. That is fine.
One more thing.
The single file you get is an html formatted file, yes? For the results that
you want, and how you want to
use it, do you really want html? If n
On 19/12/2021 02:15, Michael D. Setzer II via users wrote:
$ ./findnoascii2 allraw.uog
Think this is the issue, but no ideal how to fix it.
$ file allraw.uog.out
allraw.uog.out: Non-ISO extended-ASCII text
I assume findnoascii2 iswritten by you? Without knowing what it does (source),
I think
I've spent a number of hours trying all kinds of things I've
found on web, but not getting anywhere. Probable
something simple.
Download 64 web pages into a single file using wget2.
That is fine.
file allraw.uog
allraw.uog: HTML document, UTF-8 Unicode text, with
very long lines
File is about 13M