Hi,

I would like to forward the issue below, reported by Panu Kalliokoskii
in 2012 (better late than never!). I think the correct category is
Mark-nonspacing, but I am not very familiar with Unicode though.

It still occurs in grep 3.1. In this case, using the U+0301 acute accent:

 $ echo árbol | grep -o '[[:alpha:]]*'
 a
 rbol

Cheers,

 -- Santiago

On Mon, 05 Mar 2012 13:08:43 +0200 "Panu A. Kalliokoski" <ate...@sange.fi> 
wrote:
> Package: grep
> Version: 2.6.3-3
> Severity: normal
> 
> 
> It seems that grep misclassifies combining letters (unicode class Lm) as
> punctuation, when they should be letters.  For instance:
> 
> $ echo d̪ʌ̀lì | grep -o '[[:alpha:]]*'
> d
> ʌ
> li
> 
> As a consequence, combining accents are not seen as "word-constituent":
> 
> $ echo d̪ʌ̀lì | grep -o '\w*'
> d
> ʌ
> li
> 
> This causes also false positives on word-boundary conditions, such as
> the below:
> 
> $ echo d̪ʌ̀lì | grep -w ʌ
> d̪ʌ̀lì
> 
> I suggest that combining letters should be part of [:alpha:] instead of
> [:punct:].



Reply via email to