https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=243229
--- Comment #1 from Conrad Meyer <c...@freebsd.org> --- I'm not sure it makes sense to compute length() on UTF-8 strings as unicode codepoints. POSIX awk is somewhat clear that you're correct: > LC_CTYPE > Determine the locale for the interpretation of sequences of bytes of text > data as characters (for example, single-byte as opposed to multi-byte > characters in arguments and input files), the behavior of character classes > within regular expressions, the identification of characters as letters, and > the mapping of uppercase and lowercase characters for the toupper and > tolower functions. However, the resulting behavior around indexing is nutty: this implies that index(), match(), etc, are measured in *characters*. To do this efficiently one probably has to convert non-ASCII strings to wchar_t and operate on those. As you could imagine, that would immensely slow down awk as a fast stream processing utility. POSIX is more explicit about toupper() and tolower(), where taking locale into consideration is easier. I guess I'm not clear on what value a length() function is that operates on codepoints rather than bytes. -- You are receiving this mail because: You are the assignee for the bug. _______________________________________________ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"