Re: Accented Names in EBCDIC -> ASCII

Phil Smith III Mon, 25 Mar 2019 09:18:25 -0700

Dennis Longnecker wrote:

>I see there is an extended ASCII table which has accented characters; like the 
>hex A2 which is an accented lower case O.


 

>Is there such a character in the ebcdic world?   All my google searches for 
>EBCDIC to ASCII conversions aren't showing accented characters in EBCDIC.

 

As others have noted, the short answer is "Yes", but you're dipping a toe into 
a deep and fast-flowing river here. I'm going to be fairly normative here, but 
none of it is aimed at you, so don't take it personally please.


First, "ASCII" is a surprisingly vague term. Most people mean "7-bit ASCII" 
when they use the term; that would not, of course, include "extended" 
characters past x'7F', like your x'A2'.

 

There are historical encodings that include characters past x'7F'; nowadays, 
it's pretty well all Unicode and UTF-8.

 

UTF-8 is an *encoding* scheme (there are also UTF-16 and UTF-32). It's also 
variable in length: UTF-8 means a given glyph (thing "thing you would normally 
call a character") can be one to four bytes. Like many such schemes it uses the 
high bits to indicate which length a character is: that is, one-byte characters 
always have the high bit off. (And let's not get into big endian vs. little 
endian and assume big-endian here.) So your normal a-z and friends are Just 
What We've Always Called ASCII.

 

The Unicode comprises 17 "planes" of 65K characters each. The one we mostly use 
is the Basic Multilingual Plane, or BMP. That includes the traditional 7-bit 
ASCII characters and most other common languages, including Asian languages 
(which occupy the bulk of the space).

 

UTF-8 (-16, -32) is unambiguous (modulo normalization, see below) and so you 
can have a Latin A next to a Latin A with aigu next to a Cyrillic ya next to a 
Chinese glyph, all in the same string.

 

EBCDIC, OTOH, is (modulo DBCS) a hard-and-fast 8-bits-per-character encoding. 
Hint: that's 256 characters. Period. So the EBCDIC approach is to say "These 
characters ARE code page x", and that information is stored (hopefully) as 
metadata. That means that a given string is (FSVO "is") an English code page, 
or a French one, or Cyrillic, or Greek. Display it using the wrong code page 
and it'll be wrong-it will display characters, but not the right ones. A common 
example: code page 1047 vs. 037, which are the same except for the square 
brackets, which will be wrong in the "other" code page (we and probably some 
others have configuration data sets that use those characters, and handle 
either to make life simpler for our users).

 

So the challenge is to move between Unicode and EBCDIC. The good news: ICONV 
and ICU are your friends here. These are more-or-less standard utilities, 
available on many platforms; on z/OS, ICONV is in USS by default, and is 
extended beyond most implementations to support EBCDIC better. So if your input 
is UTF-8 (most likely) or "plain" ASCII (same thing at that level-remember, 
7-bit ASCII is a subset of UTF-8), you can convert with ICONV to a specific 
EBCDIC code page.

 

HTH

 

.phsiii

 

P.S. Unicode normalization: there are a few "mistakes" in Unicode: that is, 
characters that are duplicated at different locations. E.g., the omega and the 
ohm symbol. These display the same, so Unicode has this concept of 
"normalization", which means that it's considered legitimate to convert one of 
these to the other (with a specific target, that is, YOU don't decide which one 
you like: the normalization rules say "The omega is the right one", or vice 
versa). There are also combining characters: an a with aigu can be encoded as a 
single character, or as an a plus a "combining" aigu. Normalization converges 
these (in this case, to the single character). This is vital for comparisons: 
otherwise I send you something with the "wrong" omega or a-aigu in it and your 
searches/comparisons fail.

 

It gets worse: There are even multiply-combining code points for some 
languages, such as a "d" with dots above and below. There are both "d with a 
dot below" and "d with a dot above" characters, but there is no "d with both 
dots" character. Thus there may be four ways to represent this character:

1.               d + combining-dot-above + combining-dot-below (three code 
points)

2.               d + combining-dot-below + combining-dot-above (three code 
points)

3.               d-with-dot-above + combining-dot-below (two code points)

4.               d-with-dot-below + combining-dot-above (two code points)

 

Unicode normalization will convert any of the first three sequences to the 
fourth.

 

Note that this is why terms like "character" get fuzzy, and "glyph" or 
"grapheme" are better (and those are subtly different, though mostly you only 
care if you're talking actual fonts). A glyph can be one or more code points 
and one or more bytes, and "character" gets used imprecisely to refer to any of 
these three concepts (glyph, code point, byte), so beware!

 


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Accented Names in EBCDIC -> ASCII

Reply via email to