Hi, I would like to forward the issue below, reported by Panu Kalliokoskii in 2012 (better late than never!). I think the correct category is Mark-nonspacing, but I am not very familiar with Unicode though.
It still occurs in grep 3.1. In this case, using the U+0301 acute accent: $ echo árbol | grep -o '[[:alpha:]]*' a rbol Cheers, -- Santiago On Mon, 05 Mar 2012 13:08:43 +0200 "Panu A. Kalliokoski" <ate...@sange.fi> wrote: > Package: grep > Version: 2.6.3-3 > Severity: normal > > > It seems that grep misclassifies combining letters (unicode class Lm) as > punctuation, when they should be letters. For instance: > > $ echo d̪ʌ̀lì | grep -o '[[:alpha:]]*' > d > ʌ > li > > As a consequence, combining accents are not seen as "word-constituent": > > $ echo d̪ʌ̀lì | grep -o '\w*' > d > ʌ > li > > This causes also false positives on word-boundary conditions, such as > the below: > > $ echo d̪ʌ̀lì | grep -w ʌ > d̪ʌ̀lì > > I suggest that combining letters should be part of [:alpha:] instead of > [:punct:].