On 2017-09-04 17:55, Charles Mills wrote:
I don't understand the problem.
That's correct.
Yes, ü is two bytes (not characters as you wrote!) in UTF-8.
You're correct again.
But if the translation is working correctly and the code page is specified
correctly it should become one byte in EBCDIC, and assuming the report
program treats it as a literal of some sort -- does not expect to deduce
meaning from each byte -- it should be perfectly happy with S?d (pretending
? is an EBCDIC ü) as a district or whatever name. The report columns should
be correct, and it should come back to UTF-8 land as ü, with the proper
number of padding blanks.
It sounds like you are incorrectly translating ü to *two* EBCDIC characters,
and that is the root of your problem. See if you can't translate to an
EBCDIC code page that includes ü.
I can probably find a set of code-pages that correctly translate the two byte
UTF-8 "ü" character to a one byte EBCDIC "ü" character, but how would those same
two code-pages translate the Polish "ł", the Danish "ø", the Baltic "ė", and the
Greek "Θ", which appear in the same PC-side file to one single character... And
back to the correct UTF-8 character...
That makes the problem maybe more understandable?
Robert
Charles
-----Original Message----- From: IBM Mainframe Discussion List
[mailto:[email protected]] On Behalf Of Robert Prins Sent: Monday,
September 4, 2017 12:34 PM To: [email protected] Subject: UTF-8 woes
on z/OS, a solution - comments invited
OK, I solved the problem, but maybe someone here can come up with something
a bit more efficient...
There is a file in the non-z/OS world, that used to be pure ASCII (actually
CP437/850), but that has now been converted to UTF-8, due to further
internationalisation requirements. Said file was uploaded to z/OS, processed
into a set of datasets containing various reports, and those reports were
later downloaded to the non-z/OS world, using the same process that was used
to upload them, which could be one of two, IND$FILE, or FTP.
Both FTP and IND$FILE uploads had (and still have) no problems with
CP437/850/UTF-8 data, and although an ü might not have displayed as such on
z/OS, it would have transferred back to the same ü. However, an ü in UTF-8
now consists of two characters, and that means that, replacing spaces with
'=' characters, the original
|=Süd====| |=Nord===|
report lines now come out as
|=Süd===| |=Nord===|
when opened in the non z/OS world with an UTF-8 aware application.
---------------------------------------------------------------------- For
IBM-MAIN subscribe / signoff / archive access instructions, send email to
[email protected] with the message: INFO IBM-MAIN
--
Robert AH Prins
robert(a)prino(d)org
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN