14.03.2018 19:40 Pádraig Brady <[email protected]> wrote: > [...] > One can browse the abbreviations by length using: > > locale -a | grep utf8 | > while read l; do LC_ALL=$l locale abmon; done | > tr ';' '\n' | sort -u | grep '.\{5,\}' | > while read mon; do > printf '%02d %s\n' "$(echo "$mon" | wc -L)" "$mon" > done | > sort -n | less > > That shows a couple of existing issues with the limit of 5. > ln_CD.utf8 (Democratic Republic of the Congo) needs a length of 7 to be > unambiguous, > while Arabic needs 12! > [...] > > $ LC_ALL=ln_CD.utf8 locale abmon > sánzá1.;sánzá2.;sánzá3.;sánzá4.;sánzá5.;sánzá6.;sánzá7.;sánzá8.;sánzá9.;sánz10.;sánzá11.;sánzá12.
Nice, script, thank you. :-) The issue with ln_CD is no longer true, it has been fixed in June/July 2017. Please see the output on Fedora 28 (beta) with glibc 2.27: $ LC_ALL=ln_CD.utf8 locale abmon yan;fbl;msi;apl;mai;yun;yul;agt;stb;ɔtb;nvb;dsb but it does not help because some Arabic languages still need 12. Even worse, your script ran at the same machine gives the following output (only the final lines): ... 11 siakwa kati 11 yahbra kati 11 تشرين الأول 11 كانون الأول 12 kakamuk kati 12 pastara kati 12 waupasa kati 12 تشرين الثاني 12 كانون الثاني 15 lî wainhka kati 15 lih mairin kati (END) Those with 15 characters come from miq_NI language which has been introduced in September 2017 (glibc 2.27, released Feb 1, 2018): $ LC_ALL=miq_NI.utf8 locale abmon siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati $ LC_ALL=miq_NI.utf8 locale mon siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati But, as you can see, this locale data should be fixed because abmon and mon are the same; at least " kati" which appears everywhere may be probably removed. Also truncating the string to 12 characters probably still makes it unambiguous. While at this, I have not checked but does your tests/ls/abmon-align.sh script check for the length required to make all abbreviated month names unambiguous (i.e., how many letters can we truncate to ensure that the month names are still unambiguous) or just the longest abbreviated month name? > $ LC_ALL=ar_SY.utf8 locale abmon | tr ';' '\n' > [...] This is still true although again, mon and abmon seem to be the same in ar_SY which is probably not the best we can have. I wish I could fix it if I only knew how. :) (BTW, other Arabic variants seem to have the abbreviated month names shorter.) > [...] > Given the increase in supported size should only impact relatively few > languages > it probably makes sense to increase to 12. The attached does that > and also augments the test to find ambiguous cases. 12 is more than I asked for but that's definitely not destructive. My only remark is: please remove "Lingala" from the commit comment because it is no longer true. Otherwise the patch seems to be OK. Thank you and best regards, Rafal
