Steffen Nurpmeso wrote in <20241107013734.gC5CYhMl@steffen%sdaoden.eu>: |Jinsong Zhao wrote in | <09350f56-59c1-4a2f-b7cc-9063e0c24...@yeah.net>: ||I was trying to use st on a FreeBSD workstation, and my shell is csh. ||When I use backspace to delete the Chinese character, I observe strange ||behavior. || ||On the first, ||zjs@freebsd:~ % 中文|
not to mention that possibly only the wcwidth(3) attributes of these "So" (Symbol, other) Unicode entries is false. This is a bug of the locale tables of FreeBSD then. ... ||This behavior is observed under bash, but not under sh. Bash also uses wcwidth(3), sh seems to use BSD editline library instead, and that surely uses myriads of successive processing of mbtowc and wctomb etc to get the stuff back and forth, and likely keeps, like eg ncurses, "index slots" instead of a simple "character byte data". So that when you backspace all bytes making up an "index slot" are removed, whereas st (and mksh fwiw) simply "synchronizes back" on the "character byte data" until it finds an UTF-8 start byte. That is: with Unicode combining characters etc multiple adjacent such UTF-8 characters form a single "grapheme" in Unicode terms; many languages have / know / require that in Unicode. Ie bash: master:lib/readline/rlmbutil.h:# define WCWIDTH(wc) ((_rl_utf8locale && UNICODE_COMBINING_CHAR(wc)) ? 0 : _rl_wcwidth(wc)) With that, backspace in reality has to skip over multiple adjacent (UTF-8) characters (aka multi multi-byte bytes). For the simple line editor i have written for my MUA i use tc.tc_novis = (iswprint(wc) == 0); tc.tc_width = a_tty_wcwidth(wc); (where it is not wcwidth() because ISO C did not standardize it). I use cells aka index-slots, too. Having said that, now i confused myself. Plain is that bash on Linux (glibc 2.40) *can* handle these characters. So likely the character set data of the actual locale you are using on your specific FreeBSD does not correctly describe the symbols you mention. Now it *must* be said that in my latest UnicodeData i have (from 2019, ooops), i see 3197;IDEOGRAPHIC ANNOTATION MIDDLE MARK;So;0;L;<super> 4E2D;;;;N;KAERITEN TYUU;;;; 32A5;CIRCLED IDEOGRAPH CENTRE;So;0;L;<circle> 4E2D;;;;N;CIRCLED IDEOGRAPH CENTER;;;; 1F22D;SQUARED CJK UNIFIED IDEOGRAPH-4E2D;So;0;L;<square> 4E2D;;;;N;;;;; 2F42;KANGXI RADICAL SCRIPT;So;0;ON;<compat> 6587;;;;N;;;;; 3246;CIRCLED IDEOGRAPH SCHOOL;So;0;L;<circle> 6587;;;;N;;;;; but *no* other occurrences of U+4E2D or U+6587, so maybe the fallback for "unknown" code points is wrong. My thing uses # ifdef mx_HAVE_WCWIDTH w = (wc == '\t' ? 1 : wcwidth(wc)); # else if(wc == '\t' || iswprint(wc)) w = 1 + (wc >= 0x1100u); /* S-CText isfullwidth() */ else w = -1; # endif which is very shitty, but since both codepoints are above U+1100 we treat them as fullwidth aka of width 2. ... Hope that helps .. :/ --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) | |And in Fall, feel "The Dropbear Bard"s ball(s). | |The banded bear |without a care, |Banged on himself fore'er and e'er | |Farewell, dear collar bear