On 14/03/18 15:53, Rafal Luzynski wrote: > 14.03.2018 19:40 Pádraig Brady <[email protected]> wrote: >> [...] >> One can browse the abbreviations by length using: >> >> locale -a | grep utf8 | >> while read l; do LC_ALL=$l locale abmon; done | >> tr ';' '\n' | sort -u | grep '.\{5,\}' | >> while read mon; do >> printf '%02d %s\n' "$(echo "$mon" | wc -L)" "$mon" >> done | >> sort -n | less >> >> That shows a couple of existing issues with the limit of 5. >> ln_CD.utf8 (Democratic Republic of the Congo) needs a length of 7 to be >> unambiguous, >> while Arabic needs 12! >> [...] >> >> $ LC_ALL=ln_CD.utf8 locale abmon >> sánzá1.;sánzá2.;sánzá3.;sánzá4.;sánzá5.;sánzá6.;sánzá7.;sánzá8.;sánzá9.;sánz10.;sánzá11.;sánzá12. > > Nice, script, thank you. :-) The issue with ln_CD is no longer > true, it has been fixed in June/July 2017. Please see the output > on Fedora 28 (beta) with glibc 2.27: > > $ LC_ALL=ln_CD.utf8 locale abmon > yan;fbl;msi;apl;mai;yun;yul;agt;stb;ɔtb;nvb;dsb > > but it does not help because some Arabic languages still need 12. > Even worse, your script ran at the same machine gives the following > output (only the final lines): > > ... > 11 siakwa kati > 11 yahbra kati > 11 تشرين الأول > 11 كانون الأول > 12 kakamuk kati > 12 pastara kati > 12 waupasa kati > 12 تشرين الثاني > 12 كانون الثاني > 15 lî wainhka kati > 15 lih mairin kati > (END) > > Those with 15 characters come from miq_NI language which has been > introduced in September 2017 (glibc 2.27, released Feb 1, 2018): > > $ LC_ALL=miq_NI.utf8 locale abmon > siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî > kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati > $ LC_ALL=miq_NI.utf8 locale mon > siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî > kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati > > But, as you can see, this locale data should be fixed because abmon > and mon are the same;
> at least " kati" which appears everywhere may > be probably removed. Also truncating the string to 12 characters > probably still makes it unambiguous. > > While at this, I have not checked but does your tests/ls/abmon-align.sh > script check for the length required to make all abbreviated month > names unambiguous (i.e., how many letters can we truncate to ensure > that the month names are still unambiguous) or just the longest > abbreviated month name? It checks that 12 months for a few sample languages are unambiguous > >> $ LC_ALL=ar_SY.utf8 locale abmon | tr ';' '\n' >> [...] > > This is still true although again, mon and abmon seem to be the same > in ar_SY which is probably not the best we can have. I wish I could > fix it if I only knew how. :) A patch to glibc would be most appreciated, but as for content I don't know. I see ICU has narrow, short, long variants, but for ar_SY the narrow are ambiguous, and the short are copies of the long ones: http://demo.icu-project.org/icu-bin/locexp?d_=en&_=ar_SY > (BTW, other Arabic variants seem to have > the abbreviated month names shorter.) Right, I see the long Arabic names are derived from Aramaic: https://en.wikipedia.org/wiki/Arabic_names_of_calendar_months >> [...] >> Given the increase in supported size should only impact relatively few >> languages >> it probably makes sense to increase to 12. The attached does that >> and also augments the test to find ambiguous cases. > > 12 is more than I asked for but that's definitely not destructive. > My only remark is: please remove "Lingala" from the commit comment > because it is no longer true. Otherwise the patch seems to be OK. Given this is usually a deficiency in the locale rather than inherent in the language, I'm definitely not going above 12. I'd even drop it to 8 if there were apparent short abmons for all languages, but will leave at 12 as this isn't the case for ar_SY at least. cheers, Pádraig
