bug#30814: Please increase the value of MAX_MON_WIDTH in ls.c

Pádraig Brady Fri, 16 Mar 2018 03:18:19 -0700

On 14/03/18 15:53, Rafal Luzynski wrote:
> 14.03.2018 19:40 Pádraig Brady <[email protected]> wrote:
>> [...]
>> One can browse the abbreviations by length using:
>>
>> locale -a | grep utf8 |
>> while read l; do LC_ALL=$l locale abmon; done |
>> tr ';' '\n' | sort -u | grep '.\{5,\}' |
>> while read mon; do
>> printf '%02d %s\n' "$(echo "$mon" | wc -L)" "$mon"
>> done |
>> sort -n | less
>>
>> That shows a couple of existing issues with the limit of 5.
>> ln_CD.utf8 (Democratic Republic of the Congo) needs a length of 7 to be
>> unambiguous,
>> while Arabic needs 12!
>> [...]
>>
>> $ LC_ALL=ln_CD.utf8 locale abmon
>> sánzá1.;sánzá2.;sánzá3.;sánzá4.;sánzá5.;sánzá6.;sánzá7.;sánzá8.;sánzá9.;sánz10.;sánzá11.;sánzá12.
> 
> Nice, script, thank you. :-) The issue with ln_CD is no longer
> true, it has been fixed in June/July 2017. Please see the output
> on Fedora 28 (beta) with glibc 2.27:
> 
> $ LC_ALL=ln_CD.utf8 locale abmon
> yan;fbl;msi;apl;mai;yun;yul;agt;stb;ɔtb;nvb;dsb
> 
> but it does not help because some Arabic languages still need 12.
> Even worse, your script ran at the same machine gives the following
> output (only the final lines):
> 
> ...
> 11 siakwa kati
> 11 yahbra kati
> 11 تشرين الأول
> 11 كانون الأول
> 12 kakamuk kati
> 12 pastara kati
> 12 waupasa kati
> 12 تشرين الثاني
> 12 كانون الثاني
> 15 lî wainhka kati
> 15 lih mairin kati
> (END)
> 
> Those with 15 characters come from miq_NI language which has been
> introduced in September 2017 (glibc 2.27, released Feb 1, 2018):
> 
> $ LC_ALL=miq_NI.utf8 locale abmon
> siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî
> kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati
> $ LC_ALL=miq_NI.utf8 locale mon
> siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî
> kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati
> 
> But, as you can see, this locale data should be fixed because abmon
> and mon are the same;



> at least " kati" which appears everywhere may
> be probably removed. Also truncating the string to 12 characters
> probably still makes it unambiguous.

> 
> While at this, I have not checked but does your tests/ls/abmon-align.sh
> script check for the length required to make all abbreviated month
> names unambiguous (i.e., how many letters can we truncate to ensure
> that the month names are still unambiguous) or just the longest
> abbreviated month name?

It checks that 12 months for a few sample languages are unambiguous

> 
>> $ LC_ALL=ar_SY.utf8 locale abmon | tr ';' '\n'
>> [...]
> 
> This is still true although again, mon and abmon seem to be the same
> in ar_SY which is probably not the best we can have. I wish I could
> fix it if I only knew how. :)

A patch to glibc would be most appreciated, but as for content I don't know.
I see ICU has narrow, short, long variants, but for ar_SY the narrow are
ambiguous, and the short are copies of the long ones:
http://demo.icu-project.org/icu-bin/locexp?d_=en&_=ar_SY

> (BTW, other Arabic variants seem to have
> the abbreviated month names shorter.)

Right, I see the long Arabic names are derived from Aramaic:
https://en.wikipedia.org/wiki/Arabic_names_of_calendar_months

>> [...]
>> Given the increase in supported size should only impact relatively few
>> languages
>> it probably makes sense to increase to 12. The attached does that
>> and also augments the test to find ambiguous cases.
> 
> 12 is more than I asked for but that's definitely not destructive.
> My only remark is: please remove "Lingala" from the commit comment
> because it is no longer true. Otherwise the patch seems to be OK.

Given this is usually a deficiency in the locale rather than inherent
in the language, I'm definitely not going above 12.
I'd even drop it to 8 if there were apparent short abmons for
all languages, but will leave at 12 as this isn't the case for ar_SY at least.

cheers,
Pádraig

bug#30814: Please increase the value of MAX_MON_WIDTH in ls.c

Reply via email to