On Sun, 2021-04-18 at 22:53 +0200, Martijn van Duren wrote:
> On Sun, 2021-04-18 at 17:52 +0200, Martijn van Duren wrote:
> > I'm always frustrated when a unicode character question comes up and I
> > have to look up the UTF-8 byte sequence to reproduce it. When fixing \x
> > I found the \u and \U escape sequences in gprintf, which seem mighty
> > handy for this exact case.
> >
> > My implementation differs from gprintf in that leading zeroes can be
> > omitted, but I kept \u and \U for both compatability and for cases like
> > \u00ebb, where I don't want to add 6 zeroes just to get my desired
> > unicode character in front of an isxdigit(3) character.
> >
> > gprintf talks about "Unicode (ISO/IEC 10646)" in their manpage for the
> > \u case and just Unicode for the \U case. I read that glibc uses 10646
> > internally for wchar_t, but I have no idea how 10646 might differ from
> > true unicode for >= 0 <= 0xffff, so I stuck with just the term unicode
> > in the manpage part.
> >
> > gprintf prints the \u or \U form for characters > 0x7f and < 0x100 in
> > the C locale, where this diff currently outputs these byte values.
> > My previous diff[0] should fix this.
> >
> > OK after unlock?
> >
> > martijn@
> >
> > [0] https://marc.info/?l=openbsd-tech&m=161875718324367&w=2
> >
> Guenther asked me offlist to not ignore leading zeroes.
> Our printf implementation is already lenient, so the diff below still
> allows, but throws a warning and returns an error code on exit, similar
> to other violations.
>
And of course the [1-3] and [1-7] in the manpage need to be scratched.
And I had an of by 1 in the counter...
Index: printf.1
===================================================================
RCS file: /cvs/src/usr.bin/printf/printf.1,v
retrieving revision 1.34
diff -u -p -r1.34 printf.1
--- printf.1 16 Jan 2020 16:46:47 -0000 1.34
+++ printf.1 18 Apr 2021 21:53:47 -0000
@@ -103,6 +103,14 @@ Write a backslash character.
Write an 8-bit character whose ASCII value is
the 1-, 2-, or 3-digit octal number
.Ar num .
+.It Cm \eu Ns Ar num
+Write a unicode character whose value is
+the 4-digit hexadecimal number
+.Ar num .
+.It Cm \eU Ns Ar num
+Write a unicode character whose value is
+the 8-digit hexadecimal number
+.Ar num .
.El
.Pp
Each format specification is introduced by the percent
@@ -356,6 +364,19 @@ no argument is used.
In no case does a non-existent or small field width cause truncation of
a field; padding takes place only if the specified field width exceeds
the actual width.
+.Sh ENVIRONMENT
+.Bl -tag -width LC_CTYPE
+.It Ev LC_CTYPE
+The character encoding
+.Xr locale 1 .
+It decides which unicode values can be output in the current character
encoding.
+If a character can't be displayed in the current locale it falls back to the
+shortest full
+.Cm \eu Ns Ar num
+or
+.Cm \eU Ns Ar num
+presentation.
+.El
.Sh EXIT STATUS
.Ex -std printf
.Sh EXAMPLES
@@ -383,7 +404,9 @@ and always operates as if
were set.
.Pp
The escape sequences
-.Cm \ee
+.Cm \ee ,
+.Cm \eu ,
+.Cm \eU
and
.Cm \e' ,
as well as omitting the leading digit
Index: printf.c
===================================================================
RCS file: /cvs/src/usr.bin/printf/printf.c,v
retrieving revision 1.26
diff -u -p -r1.26 printf.c
--- printf.c 18 Nov 2016 15:53:16 -0000 1.26
+++ printf.c 18 Apr 2021 21:53:47 -0000
@@ -33,10 +33,12 @@
#include <err.h>
#include <errno.h>
#include <limits.h>
+#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
+#include <wchar.h>
static int print_escape_str(const char *);
static int print_escape(const char *);
@@ -79,6 +81,8 @@ main(int argc, char *argv[])
char convch, nextch;
char *format;
+ setlocale(LC_CTYPE, "");
+
if (pledge("stdio", NULL) == -1)
err(1, "pledge");
@@ -275,8 +279,10 @@ static int
print_escape(const char *str)
{
const char *start = str;
+ char mbc[MB_LEN_MAX + 1];
+ wchar_t wc = 0;
int value;
- int c;
+ int c, i;
str++;
@@ -348,6 +354,29 @@ print_escape(const char *str)
case 't': /* tab */
putchar('\t');
break;
+
+ case 'U':
+ case 'u':
+ c = *str == 'U' ? 8 : 4;
+ str++;
+ for (; c-- && isxdigit((unsigned char)*str); str++) {
+ wc <<= 4;
+ wc += hextobin(*str);
+ }
+ if (c != -1) {
+ warnx("missing hexadecimal number in escape");
+ rval = 1;
+ }
+ if ((c = wctomb(mbc, wc)) == -1) {
+ printf("\\%c%0*X", wc > 0xffff ? 'U' : 'u',
+ wc > 0xffff ? 8 : 4, wc);
+ wc = L'\0';
+ wctomb(NULL, wc);
+ } else {
+ for (i = 0; i < c; i++)
+ putchar(mbc[i]);
+ }
+ return str - start - 1;
case 'v': /* vertical-tab */
putchar('\v');