On 2014-04-16 12.51, Kevin Bracey wrote: > On 16/04/2014 07:48, Torsten Bögershausen wrote: >> On 15.04.14 21:10, Peter Krefting wrote: >>> Torsten Bögershausen: >>> >>>> diff --git a/utf8.c b/utf8.c >>>> index a831d50..77c28d4 100644 >>>> --- a/utf8.c >>>> +++ b/utf8.c >>> Is there a script that generates this code from the Unicode database files, >>> or did you hand-update it? >>> >> Some of the code points which have "0 length on the display" are called >> "combining", others are called "vowels" or "accents". >> E.g. 5BF is not marked any of them, but if you look at the glyph, it should >> be combining (please correct me if that is wrong). > > Indeed it is combining (more specifically it has General Category > "Nonspacing_Mark" = "Mn"). > >> >> If I could have found a file which indicates for each code point, what it >> is, I could write a script. >> > > The most complete and machine-readable data are in these files: > > http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt > http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt > > The general categories can also be seen more legibly in: > > http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt > > For docs, see: > > http://www.unicode.org/reports/tr44/ > http://www.unicode.org/reports/tr11/ > http://www.unicode.org/ucd/ > > The existing utf8.c comments describe the attributes being selected from the > tables (general categories "Cf","Mn","Me", East Asian Width "W", "F"). And > they suggest that the combining character table was originally auto-generated > from UnicodeData.txt with a "uniset" tool. Presumably this? > > https://github.com/depp/uniset > > The fullwidth-checking code looks like it was done by hand, although > apparently uniset can process EastAsianWidth.txt. > > Kevin Excellent, thanks for the pointers. Running the script below shows that "0X00AD SOFT HYPHEN" should have zero length (and some others too). I wonder if that is really the case, and which one of the last 2 lines in the script is the right one.
What does this mean for us: "Cf Format a format control character" #!/bin/sh if ! test -f UnicodeData.txt; then wget http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt fi && if ! test -f EastAsianWidth.txt; then wget http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt fi if ! test -f DerivedGeneralCategory.txt; then wget http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt fi && if ! test -d uniset; then git clone https://github.com/tboegi/uniset.git fi && ( cd uniset && if ! test -x uniset; then autoreconf -i && ./configure --enable-warnings=-Werror CFLAGS='-O0 -ggdb' fi && make ) && UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn,Cf #UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn > > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html