On Mon, 13 Jul 2020 13:02:44 +0200, Jan Stary wrote: > This is current/amd64. > > On UTF input, awk segfaults when using a multi-character RS: > > $ cat /tmp/in > č > > $ hexdump -C /tmp/in > 00000000 c4 8d 0a |...| > 00000003 > > $ cat /tmp/in | awk '{print$1}' > č > > $ cat /tmp/in | awk -v RS=x '{print$1}' > č > > $ cat /tmp/in | awk -v RS=xy '{print$1}' > Segmentation fault (core dumped)
Nice catch. The actual bug is caused by using a signed char as an index into an array, resulting in a negative index. Once debugged, the fix is simple. - todd diff --git a/b.c b/b.c index c167b50..f7fbc0e 100644 --- a/b.c +++ b/b.c @@ -684,7 +684,7 @@ bool fnematch(fa *pfa, FILE *f, char **pbuf, int *pbufsize, int quantum) FATAL("stream '%.30s...' too long", buf); buf[k++] = (c = getc(f)) != EOF ? c : 0; } - c = buf[j]; + c = (unsigned char)buf[j]; /* assert(c < NCHARS); */ if ((ns = pfa->gototab[s][c]) != 0)