[Bug 243229] awk in base system does not work with UTF-8 strings correctly

bugzilla-noreply Thu, 09 Jan 2020 17:47:16 -0800

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=243229


--- Comment #1 from Conrad Meyer <c...@freebsd.org> ---
I'm not sure it makes sense to compute length() on UTF-8 strings as unicode
codepoints.  POSIX awk is somewhat clear that you're correct:


> LC_CTYPE
> Determine the locale for the interpretation of sequences of bytes of text
> data as characters (for example, single-byte as opposed to multi-byte
> characters in arguments and input files), the behavior of character classes
> within regular expressions, the identification of characters as letters, and
> the mapping of uppercase and lowercase characters for the toupper and
> tolower functions.

However, the resulting behavior around indexing is nutty: this implies that
index(), match(), etc, are measured in *characters*.  To do this efficiently
one probably has to convert non-ASCII strings to wchar_t and operate on those. 
As you could imagine, that would immensely slow down awk as a fast stream
processing utility.

POSIX is more explicit about toupper() and tolower(), where taking locale into
consideration is easier.

I guess I'm not clear on what value a length() function is that operates on
codepoints rather than bytes.

-- 
You are receiving this mail because:
You are the assignee for the bug.
_______________________________________________
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"

[Bug 243229] awk in base system does not work with UTF-8 strings correctly

Reply via email to