On 2017-09-04 17:55, Charles Mills wrote:
I don't understand the problem.

That's correct.

Yes, ü is two bytes (not characters as you wrote!) in UTF-8.

You're correct again.

But if the translation is working correctly and the code page is specified correctly it should become one byte in EBCDIC, and assuming the report program treats it as a literal of some sort -- does not expect to deduce meaning from each byte -- it should be perfectly happy with S?d (pretending
? is an EBCDIC ü) as a district or whatever name. The report columns should
be correct, and it should come back to UTF-8 land as ü, with the proper
number of padding blanks.

It sounds like you are incorrectly translating ü to *two* EBCDIC characters,
and that is the root of your problem. See if you can't translate to an
EBCDIC code page that includes ü.

I can probably find a set of code-pages that correctly translate the two byte
UTF-8 "ü" character to a one byte EBCDIC "ü" character, but how would those same
two code-pages translate the Polish "ł", the Danish "ø", the Baltic "ė", and the Greek "Θ", which appear in the same PC-side file to one single character... And back to the correct UTF-8 character...

That makes the problem maybe more understandable?

Robert

Charles


-----Original Message----- From: IBM Mainframe Discussion List [mailto:[email protected]] On Behalf Of Robert Prins Sent: Monday, September 4, 2017 12:34 PM To: [email protected] Subject: UTF-8 woes on z/OS, a solution - comments invited

OK, I solved the problem, but maybe someone here can come up with something
a bit more efficient...

There is a file in the non-z/OS world, that used to be pure ASCII (actually CP437/850), but that has now been converted to UTF-8, due to further internationalisation requirements. Said file was uploaded to z/OS, processed into a set of datasets containing various reports, and those reports were later downloaded to the non-z/OS world, using the same process that was used to upload them, which could be one of two, IND$FILE, or FTP.

Both FTP and IND$FILE uploads had (and still have) no problems with CP437/850/UTF-8 data, and although an ü might not have displayed as such on z/OS, it would have transferred back to the same ü. However, an ü in UTF-8 now consists of two characters, and that means that, replacing spaces with '=' characters, the original

|=Süd====| |=Nord===|

report lines now come out as

|=Süd===| |=Nord===|

when opened in the non z/OS world with an UTF-8 aware application.

---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN



--
Robert AH Prins
robert(a)prino(d)org

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to