Dennis Longnecker wrote: >I see there is an extended ASCII table which has accented characters; like the >hex A2 which is an accented lower case O.
>Is there such a character in the ebcdic world? All my google searches for >EBCDIC to ASCII conversions aren't showing accented characters in EBCDIC. As others have noted, the short answer is "Yes", but you're dipping a toe into a deep and fast-flowing river here. I'm going to be fairly normative here, but none of it is aimed at you, so don't take it personally please. First, "ASCII" is a surprisingly vague term. Most people mean "7-bit ASCII" when they use the term; that would not, of course, include "extended" characters past x'7F', like your x'A2'. There are historical encodings that include characters past x'7F'; nowadays, it's pretty well all Unicode and UTF-8. UTF-8 is an *encoding* scheme (there are also UTF-16 and UTF-32). It's also variable in length: UTF-8 means a given glyph (thing "thing you would normally call a character") can be one to four bytes. Like many such schemes it uses the high bits to indicate which length a character is: that is, one-byte characters always have the high bit off. (And let's not get into big endian vs. little endian and assume big-endian here.) So your normal a-z and friends are Just What We've Always Called ASCII. The Unicode comprises 17 "planes" of 65K characters each. The one we mostly use is the Basic Multilingual Plane, or BMP. That includes the traditional 7-bit ASCII characters and most other common languages, including Asian languages (which occupy the bulk of the space). UTF-8 (-16, -32) is unambiguous (modulo normalization, see below) and so you can have a Latin A next to a Latin A with aigu next to a Cyrillic ya next to a Chinese glyph, all in the same string. EBCDIC, OTOH, is (modulo DBCS) a hard-and-fast 8-bits-per-character encoding. Hint: that's 256 characters. Period. So the EBCDIC approach is to say "These characters ARE code page x", and that information is stored (hopefully) as metadata. That means that a given string is (FSVO "is") an English code page, or a French one, or Cyrillic, or Greek. Display it using the wrong code page and it'll be wrong-it will display characters, but not the right ones. A common example: code page 1047 vs. 037, which are the same except for the square brackets, which will be wrong in the "other" code page (we and probably some others have configuration data sets that use those characters, and handle either to make life simpler for our users). So the challenge is to move between Unicode and EBCDIC. The good news: ICONV and ICU are your friends here. These are more-or-less standard utilities, available on many platforms; on z/OS, ICONV is in USS by default, and is extended beyond most implementations to support EBCDIC better. So if your input is UTF-8 (most likely) or "plain" ASCII (same thing at that level-remember, 7-bit ASCII is a subset of UTF-8), you can convert with ICONV to a specific EBCDIC code page. HTH .phsiii P.S. Unicode normalization: there are a few "mistakes" in Unicode: that is, characters that are duplicated at different locations. E.g., the omega and the ohm symbol. These display the same, so Unicode has this concept of "normalization", which means that it's considered legitimate to convert one of these to the other (with a specific target, that is, YOU don't decide which one you like: the normalization rules say "The omega is the right one", or vice versa). There are also combining characters: an a with aigu can be encoded as a single character, or as an a plus a "combining" aigu. Normalization converges these (in this case, to the single character). This is vital for comparisons: otherwise I send you something with the "wrong" omega or a-aigu in it and your searches/comparisons fail. It gets worse: There are even multiply-combining code points for some languages, such as a "d" with dots above and below. There are both "d with a dot below" and "d with a dot above" characters, but there is no "d with both dots" character. Thus there may be four ways to represent this character: 1. d + combining-dot-above + combining-dot-below (three code points) 2. d + combining-dot-below + combining-dot-above (three code points) 3. d-with-dot-above + combining-dot-below (two code points) 4. d-with-dot-below + combining-dot-above (two code points) Unicode normalization will convert any of the first three sequences to the fourth. Note that this is why terms like "character" get fuzzy, and "glyph" or "grapheme" are better (and those are subtly different, though mostly you only care if you're talking actual fonts). A glyph can be one or more code points and one or more bytes, and "character" gets used imprecisely to refer to any of these three concepts (glyph, code point, byte), so beware! ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN