Re: [PATCH] Unicode: update of combining code points

Torsten Bögershausen Wed, 16 Apr 2014 12:59:26 -0700

On 2014-04-16 12.51, Kevin Bracey wrote:
> On 16/04/2014 07:48, Torsten Bögershausen wrote:
>> On 15.04.14 21:10, Peter Krefting wrote:
>>> Torsten Bögershausen:
>>>
>>>> diff --git a/utf8.c b/utf8.c
>>>> index a831d50..77c28d4 100644
>>>> --- a/utf8.c
>>>> +++ b/utf8.c
>>> Is there a script that generates this code from the Unicode database files, 
>>> or did you hand-update it?
>>>
>> Some of the code points which have "0 length on the display" are called
>> "combining", others are called "vowels" or "accents".
>> E.g. 5BF is not marked any of them, but if you look at the glyph, it should
>> be combining (please correct me if that is wrong).
> 
> Indeed it is combining (more specifically it has General Category 
> "Nonspacing_Mark" = "Mn").
> 
>>
>> If I could have found a file which indicates for each code point, what it
>> is, I could write a script.
>>
> 
> The most complete and machine-readable data are in these files:
> 
> http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
> http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
> 
> The general categories can also be seen more legibly in:
> 
> http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
> 
> For docs, see:
> 
> http://www.unicode.org/reports/tr44/
> http://www.unicode.org/reports/tr11/
> http://www.unicode.org/ucd/
> 
> The existing utf8.c comments describe the attributes being selected from the 
> tables (general categories "Cf","Mn","Me", East Asian Width "W", "F"). And 
> they suggest that the combining character table was originally auto-generated 
> from UnicodeData.txt with a "uniset" tool. Presumably this?
> 
> https://github.com/depp/uniset
> 
> The fullwidth-checking code looks like it was done by hand, although 
> apparently uniset can process EastAsianWidth.txt.
> 
> Kevin
Excellent, thanks for the pointers.
Running the script below shows that 
"0X00AD SOFT HYPHEN" should have zero length (and some others too).
I wonder if that is really the case, and which one of the last 2 lines 
in the script is the right one.


What does this mean for us:
"Cf     Format  a format control character"


#!/bin/sh

if ! test -f UnicodeData.txt; then
  wget http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
fi &&
if ! test -f EastAsianWidth.txt; then
  wget http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
fi
if ! test -f DerivedGeneralCategory.txt; then
  wget 
http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
fi &&
if ! test -d uniset; then
  git clone https://github.com/tboegi/uniset.git
fi &&
(
  cd uniset &&
  if ! test -x uniset; then 
    autoreconf -i &&
    ./configure --enable-warnings=-Werror CFLAGS='-O0 -ggdb'
  fi &&
  make
) &&
UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn,Cf
#UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn










> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Unicode: update of combining code points

Reply via email to