>> James K. Lowden wrote: >>> > 1. "National" support. COBOL programs define the runtime encoding and >>> > collation of each string (sometimes implicitly). COBOL defines two >>> > encodings: "alphanumeric" and "national". Every alphanumeric (and >>> > national) variable and literal has a defined runtime encoding that is >>> > distinct from the compile-time and runtime locale, and from the >>> > encoding of the source code. This means >>> > >>> > MOVE 'foo' TO FOO. >>> > >>> > may involve iconv(3) and >>> > >>> > IF 'foo' = FOO >>> > >>> > is defined as true/false depending on the *characters* represented, not >>> > their encoding. That 'foo' could be CP1140 (single-byte EBCDIC) and >>> > FOO could be UTF-16. >>> > ... >>> > Conversion is a solved problem. Comparison is not. >> >> Comparison consists of two steps: >> 1) Convert both operands to Unicode. (Can be UTF-8, UTF-16, or UTF-32, >> which one does not matter.) >> 2) If a "closed world" assumption is valid: >> Compare the two Unicode strings. >> Otherwise: >> Convert the two Unicode strings to normalization form NFD, and >> compare the results. >> >> By "closed world" I mean: Unicode text exchanged between programs >> is typically assumed to be in Unicode normalization form NFC. See >> https://www.unicode.org/faq/normalization.html#2 . If this assumption >> holds, you don't need the normalization step above. Whereas if it >> does not hold, for example, because the program can read arbitrary >> text files, you need this normalization step. >> >> Paul Koning wrote: >>> Unicode comparison is addressed by the "stringprep" library. >> >> Careful: "stringprep" does extra steps, which drop characters. See >> https://datatracker.ietf.org/doc/html/rfc3454#section-3 >> >>> > 2) a limited amount >>> > of Unicode evaluation is available in (IIRC) gnulib >> >> Correct. The comparison without normalization is available in >> libunistring as functions u8_cmp, u16_cmp, u32_cmp >> https://www.gnu.org/software/libunistring/manual/html_node/Comparing-Unicode-strings.html >> or u8_strcmp, u16_strcmp, u32_strcmp: >> https://www.gnu.org/software/libunistring/manual/html_node/Comparing-NUL-terminated-Unicode-strings.html >> Whereas the comparison with normalization is available as >> functions u8_normcmp, u16_normcmp, u32_normcmp: >> https://www.gnu.org/software/libunistring/manual/html_node/Normalizing-comparisons.html >> >> In Gnulib, each of these functions is available as a Gnulib module: >> https://www.gnu.org/software/gnulib/manual/html_node/How-to-use-libunistring.html >> https://www.gnu.org/software/gnulib/manual/html_node/_003cunistr_002eh_003e-modules.html >> https://www.gnu.org/software/gnulib/manual/html_node/_003cuninorm_002eh_003e-modules.html >> >> Jose Marchesi writes: >>> It would be good to avoid duplicating that code though. >> >> Especially as Unicode normalization is a rather complicated algorithm, >> that includes data tables that change with every Unicode version. >> If you duplicate that code, upgrades to newer Unicode versions (that >> are released once a year) don't come for free. Whereas if you use >> libunistring or Gnulib, they do come for free. > > Of all the libunistring functions I have copied in libga68, these are > the ones I had to adapt to support strides: > > int _libga68_u32_cmp (const uint32_t *s1, size_t stride1, > const uint32_t *s2, size_t stride2, > size_t n); > int _libga68_u32_cmp2 (const uint32_t *s1, size_t n1, size_t stride1, > const uint32_t *s2, size_t n2, size_t stride2); > > uint8_t *_libga68_u32_to_u8 (const uint32_t *s, size_t n, size_t stride, > uint8_t *resultbuf, size_t *lengthp); > > Should I pursue a libunistring patch adding stride-aware extra > interfaces like these?
Never mind, Bruno pointed out in another email that such stride-aware interfaces would be too specialized for the general purpose library.
