Re: printf(1): support \u and \U

Martijn van Duren Fri, 07 May 2021 07:40:47 -0700

Any takers?

Updated diff after previous commit below.


martijn@

On Sun, 2021-04-18 at 23:54 +0200, Martijn van Duren wrote:
> On Sun, 2021-04-18 at 22:53 +0200, Martijn van Duren wrote:
> > On Sun, 2021-04-18 at 17:52 +0200, Martijn van Duren wrote:
> > > I'm always frustrated when a unicode character question comes up and I
> > > have to look up the UTF-8 byte sequence to reproduce it. When fixing \x
> > > I found the \u and \U escape sequences in gprintf, which seem mighty
> > > handy for this exact case.
> > > 
> > > My implementation differs from gprintf in that leading zeroes can be
> > > omitted, but I kept \u and \U for both compatability and for cases like
> > > \u00ebb, where I don't want to add 6 zeroes just to get my desired
> > > unicode character in front of an isxdigit(3) character.
> > > 
> > > gprintf talks about "Unicode (ISO/IEC 10646)" in their manpage for the
> > > \u case and just Unicode for the \U case. I read that glibc uses 10646
> > > internally for wchar_t, but I have no idea how 10646 might differ from
> > > true unicode for >= 0 <= 0xffff, so I stuck with just the term unicode
> > > in the manpage part.
> > > 
> > > gprintf prints the \u or \U form for characters > 0x7f and < 0x100 in
> > > the C locale, where this diff currently outputs these byte values.
> > > My previous diff[0] should fix this.
> > > 
> > > OK after unlock?
> > > 
> > > martijn@
> > > 
> > > [0] https://marc.info/?l=openbsd-tech&m=161875718324367&w=2
> > > 
> > Guenther asked me offlist to not ignore leading zeroes.
> > Our printf implementation is already lenient, so the diff below still
> > allows, but throws a warning and returns an error code on exit, similar
> > to other violations.
> > 
> And of course the [1-3] and [1-7] in the manpage need to be scratched.
> And I had an of by 1 in the counter... 
> 

Index: printf.1
===================================================================
RCS file: /cvs/src/usr.bin/printf/printf.1,v
retrieving revision 1.35
diff -u -p -r1.35 printf.1
--- printf.1    7 May 2021 14:31:27 -0000       1.35
+++ printf.1    7 May 2021 14:39:07 -0000
@@ -108,6 +108,14 @@ Write an 8-bit character whose ASCII val
 the 1- or 2-digit hexadecimal
 number
 .Ar num .
+.It Cm \eu Ns Ar num
+Write a unicode character whose value is
+the 4-digit hexadecimal number
+.Ar num .
+.It Cm \eU Ns Ar num
+Write a unicode character whose value is
+the 8-digit hexadecimal number
+.Ar num .
 .El
 .Pp
 Each format specification is introduced by the percent
@@ -361,6 +369,19 @@ no argument is used.
 In no case does a non-existent or small field width cause truncation of
 a field; padding takes place only if the specified field width exceeds
 the actual width.
+.Sh ENVIRONMENT
+.Bl -tag -width LC_CTYPE
+.It Ev LC_CTYPE
+The character encoding
+.Xr locale 1 .
+It decides which unicode values can be output in the current character 
encoding.
+If a character can't be displayed in the current locale it falls back to the
+shortest full
+.Cm \eu Ns Ar num
+or
+.Cm \eU Ns Ar num
+presentation.
+.El
 .Sh EXIT STATUS
 .Ex -std printf
 .Sh EXAMPLES
@@ -389,6 +410,8 @@ were set.
 .Pp
 The escape sequences
 .Cm \ee ,
+.Cm \eu ,
+.Cm \eU ,
 .Cm \ex
 and
 .Cm \e' ,
Index: printf.c
===================================================================
RCS file: /cvs/src/usr.bin/printf/printf.c,v
retrieving revision 1.27
diff -u -p -r1.27 printf.c
--- printf.c    7 May 2021 14:31:27 -0000       1.27
+++ printf.c    7 May 2021 14:39:07 -0000
@@ -33,10 +33,12 @@
 #include <err.h>
 #include <errno.h>
 #include <limits.h>
+#include <locale.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <wchar.h>
 
 static int      print_escape_str(const char *);
 static int      print_escape(const char *);
@@ -79,6 +81,8 @@ main(int argc, char *argv[])
        char convch, nextch;
        char *format;
 
+       setlocale(LC_CTYPE, "");
+
        if (pledge("stdio", NULL) == -1)
                err(1, "pledge");
 
@@ -275,8 +279,10 @@ static int
 print_escape(const char *str)
 {
        const char *start = str;
+       char mbc[MB_LEN_MAX + 1];
+       wchar_t wc = 0;
        int value = 0;
-       int c;
+       int c, i;
 
        str++;
 
@@ -344,6 +350,29 @@ print_escape(const char *str)
        case 't':                       /* tab */
                putchar('\t');
                break;
+
+       case 'U':
+       case 'u':
+               c = *str == 'U' ? 8 : 4;
+               str++;
+               for (; c-- && isxdigit((unsigned char)*str); str++) {
+                       wc <<= 4;
+                       wc += hextobin(*str);
+               }
+               if (c != -1) {
+                       warnx("missing hexadecimal number in escape");
+                       rval = 1;
+               }
+               if ((c = wctomb(mbc, wc)) == -1) {
+                       printf("\\%c%0*X", wc > 0xffff ? 'U' : 'u',
+                           wc > 0xffff ? 8 : 4, wc);
+                       wc = L'\0';
+                       wctomb(NULL, wc);
+               } else {
+                       for (i = 0; i < c; i++)
+                               putchar(mbc[i]);
+               }
+               return str - start - 1;
 
        case 'v':                       /* vertical-tab */
                putchar('\v');

Re: printf(1): support \u and \U

Reply via email to