Any takers? Updated diff after previous commit below.
martijn@ On Sun, 2021-04-18 at 23:54 +0200, Martijn van Duren wrote: > On Sun, 2021-04-18 at 22:53 +0200, Martijn van Duren wrote: > > On Sun, 2021-04-18 at 17:52 +0200, Martijn van Duren wrote: > > > I'm always frustrated when a unicode character question comes up and I > > > have to look up the UTF-8 byte sequence to reproduce it. When fixing \x > > > I found the \u and \U escape sequences in gprintf, which seem mighty > > > handy for this exact case. > > > > > > My implementation differs from gprintf in that leading zeroes can be > > > omitted, but I kept \u and \U for both compatability and for cases like > > > \u00ebb, where I don't want to add 6 zeroes just to get my desired > > > unicode character in front of an isxdigit(3) character. > > > > > > gprintf talks about "Unicode (ISO/IEC 10646)" in their manpage for the > > > \u case and just Unicode for the \U case. I read that glibc uses 10646 > > > internally for wchar_t, but I have no idea how 10646 might differ from > > > true unicode for >= 0 <= 0xffff, so I stuck with just the term unicode > > > in the manpage part. > > > > > > gprintf prints the \u or \U form for characters > 0x7f and < 0x100 in > > > the C locale, where this diff currently outputs these byte values. > > > My previous diff[0] should fix this. > > > > > > OK after unlock? > > > > > > martijn@ > > > > > > [0] https://marc.info/?l=openbsd-tech&m=161875718324367&w=2 > > > > > Guenther asked me offlist to not ignore leading zeroes. > > Our printf implementation is already lenient, so the diff below still > > allows, but throws a warning and returns an error code on exit, similar > > to other violations. > > > And of course the [1-3] and [1-7] in the manpage need to be scratched. > And I had an of by 1 in the counter... > Index: printf.1 =================================================================== RCS file: /cvs/src/usr.bin/printf/printf.1,v retrieving revision 1.35 diff -u -p -r1.35 printf.1 --- printf.1 7 May 2021 14:31:27 -0000 1.35 +++ printf.1 7 May 2021 14:39:07 -0000 @@ -108,6 +108,14 @@ Write an 8-bit character whose ASCII val the 1- or 2-digit hexadecimal number .Ar num . +.It Cm \eu Ns Ar num +Write a unicode character whose value is +the 4-digit hexadecimal number +.Ar num . +.It Cm \eU Ns Ar num +Write a unicode character whose value is +the 8-digit hexadecimal number +.Ar num . .El .Pp Each format specification is introduced by the percent @@ -361,6 +369,19 @@ no argument is used. In no case does a non-existent or small field width cause truncation of a field; padding takes place only if the specified field width exceeds the actual width. +.Sh ENVIRONMENT +.Bl -tag -width LC_CTYPE +.It Ev LC_CTYPE +The character encoding +.Xr locale 1 . +It decides which unicode values can be output in the current character encoding. +If a character can't be displayed in the current locale it falls back to the +shortest full +.Cm \eu Ns Ar num +or +.Cm \eU Ns Ar num +presentation. +.El .Sh EXIT STATUS .Ex -std printf .Sh EXAMPLES @@ -389,6 +410,8 @@ were set. .Pp The escape sequences .Cm \ee , +.Cm \eu , +.Cm \eU , .Cm \ex and .Cm \e' , Index: printf.c =================================================================== RCS file: /cvs/src/usr.bin/printf/printf.c,v retrieving revision 1.27 diff -u -p -r1.27 printf.c --- printf.c 7 May 2021 14:31:27 -0000 1.27 +++ printf.c 7 May 2021 14:39:07 -0000 @@ -33,10 +33,12 @@ #include <err.h> #include <errno.h> #include <limits.h> +#include <locale.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> +#include <wchar.h> static int print_escape_str(const char *); static int print_escape(const char *); @@ -79,6 +81,8 @@ main(int argc, char *argv[]) char convch, nextch; char *format; + setlocale(LC_CTYPE, ""); + if (pledge("stdio", NULL) == -1) err(1, "pledge"); @@ -275,8 +279,10 @@ static int print_escape(const char *str) { const char *start = str; + char mbc[MB_LEN_MAX + 1]; + wchar_t wc = 0; int value = 0; - int c; + int c, i; str++; @@ -344,6 +350,29 @@ print_escape(const char *str) case 't': /* tab */ putchar('\t'); break; + + case 'U': + case 'u': + c = *str == 'U' ? 8 : 4; + str++; + for (; c-- && isxdigit((unsigned char)*str); str++) { + wc <<= 4; + wc += hextobin(*str); + } + if (c != -1) { + warnx("missing hexadecimal number in escape"); + rval = 1; + } + if ((c = wctomb(mbc, wc)) == -1) { + printf("\\%c%0*X", wc > 0xffff ? 'U' : 'u', + wc > 0xffff ? 8 : 4, wc); + wc = L'\0'; + wctomb(NULL, wc); + } else { + for (i = 0; i < c; i++) + putchar(mbc[i]); + } + return str - start - 1; case 'v': /* vertical-tab */ putchar('\v');
