On 19 Dec 2021 at 9:14, Ed Greshko wrote:

From:   Ed Greshko <ed.gres...@greshko.com>
Date sent:      Sun, 19 Dec 2021 09:14:37 +0800
Subject:        Re: Having strange result on processing UTF-8 file
To:     "Michael D. Setzer II" <mi...@guam.net>,
        Community support for Fedora users <users@lists.fedoraproject.org>
Send reply to:  Community support for Fedora users 
<users@lists.fedoraproject.org>

> On 19/12/2021 08:31, Michael D. Setzer II wrote:
> 
>     But could change if they add more or remove some 
>     currently 633 records. Some lines in the file are over 
>     25000 characters?? Total download is about 13M.
>     The actual lines I need for the data are just 256K, so it 
>     has lots of junk (stuff I don't need for what I'm doing).
> 
> That 13M file. Does it contain html? If so, would it be easier 
> to work with if it was converted to plain text?

Yes, they are all html pages, but some of the UTF-8 
characters don't match to a plain text charter and it is the 
name field. Did figure out the issue. %10.10s and 
%20.20s both would cause the problem. So I used the 
head command to pull various number of lines until I 
found where the file went Non-ISO extended ascii.
Was only a few lines that caused issue, and it was the last 
character in substring being a character above 127.

So added these commands to copy 30 characters from the 
point, but would then go from end and if last character 
was >127 change it to null
strcpy(linex,&line[i]);
linex[30]=0;
while(linex[strlen(linex)-1]>127) linex[strlen(linex)-1]=0;

The used %s and just printed linex. 
218544 lines in allraw.uog
    1898 lines in allraw.uog.out (lines with utf-8)

The uog.csv has 633 lines but only these 3 have utf-8
  131    27 c3b1     [ña, Ph.D.;Crisostomo-Muña;Do]
  131    51 c3b1     [ña;Doreen;Professor of Accoun]
  276    14 c3a5     [åni" Isidro;Isidro;Jaevani;Ju]
  344    18 c381     [Álvarez-Piñer, Ph.D.;Madrid ]
  344    29 c3b1     [ñer, Ph.D.;Madrid Álvarez-Pi]
  344    48 c381     [Álvarez-Piñer;Carlos;Directo]
  344    59 c3b1     [ñer;Carlos;Director / Associa]

Whole web page has a lot of other utf-8 characters.

Thanks again.


> --
> Did 황준호 die?

_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Reply via email to