Re: [racket-users] Are Regular Expression classes Unicode aware?

Peter W A Wood Fri, 10 Jul 2020 03:29:32 -0700

Dear Ryan

Thank you very much for the kind, detailed explanation which I will study 
carefully. It was not my intention to reply to you off-list. I hope I have 
correctly addressed this reply to appear on-list.


Peter

> On 10 Jul 2020, at 15:47, Ryan Culpepper <rmculpepp...@gmail.com> wrote:
> 
> (I see this went off the mailing list. If you reply, please consider CCing 
> the list.)
> 
> Yes, I understood your goal of trying to capture the notion of Unicode 
> "alphabetic" characters with a regular expression.
> 
> As far as I know, Unicode doesn't have a notion of "alphabetic", but it does 
> assign every code point to a "General category", consisting of a main 
> category and a subcategory. There is a category called "Letter", which seems 
> like one reasonable generalization of "alphabetic".
> 
> In Racket, you can get the code for a character's category using 
> `char-general-category`. For example:
> 
>   > (char-general-category #\A)
>   'lu
>   > (char-general-category #\é)
>   'll
>   > (char-general-category #\ß)
>   'll
>   > (char-general-category #\7)
>   'nd
> 
> The general category for "A" is "Letter, uppercase", which has the code "Lu", 
> which Racket turns into the symbol 'lu. The general category of "é" is 
> "Letter, lowercase", code "Ll", which becomes 'll. The general category of 
> "7" is "Number, decimal digit", code "Nd".
> 
> In Racket regular expressions, the \p{category} syntax lets you recognize 
> characters from a specific category. For example, \p{Lu} recognizes 
> characters with the category "Letter, uppercase", and \p{L} recognizes 
> characters with the category "Letter", which is the union of "Letter, 
> uppercase", "Letter, lowercase", and so on.
> 
> So the regular expression #px"^\\p{L}+$" recognizes sequences of one or more 
> Unicode letters. For example:
> 
>   > (regexp-match? #px"^\\p{L}+$" "héllo")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "straße")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "二の句")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "abc123")
>   #f ;; No, contains numbers
> 
> There are still some problems to watch out for, though. For example, accented 
> characters like "é" can be expressed as a single pre-composed code point or 
> "decomposed" into a base letter and a combining mark. You can get the 
> decomposed form by converting the string to "decomposed normal form" (NFD), 
> and the regexp above won't match that string.
> 
>   > (map char-general-category (string->list (string-normalize-nfd "é")))
>   '(ll mn)
>   > (regexp-match? #px"^\\p{L}+$" (string-normalize-nfd "héllo"))
>   #f
> 
> One fix would be to call `string-normalize-nfc` first, but some 
> letter-modifier pairs don't have pre-composed versions. Another fix would be 
> to expand the regexp to include modifiers. You'd have to decide which is 
> better based on your application.
> 
> Ryan
> 
> 
> 
> On Fri, Jul 10, 2020 at 2:10 AM Peter W A Wood <peterwaw...@gmail.com> wrote:
> Ryan
> 
> > On 9 Jul 2020, at 22:52, Ryan Culpepper <rmculpepp...@gmail.com> wrote:
> > 
> > If you want a regular expression that does match the example string, you 
> > can use the \p{property} notation. For example:
> > 
> >   > (regexp-match? #px"^\\p{L}+$" "h\uFFC3\uFFA9llo")
> >   #t
> > 
> > The "Regexp Syntax" docs have a grammar for regular expressions with links 
> > to examples.
> > 
> > Ryan
> 
> Thanks. I used héllo as an example. I was wondering if there was a way of 
> specifying a regular expression group for Unicode “alphabetic” characters. 
> 
> On reflection, it seems a somewhat esoteric requirement that is almost 
> impossible to satisfy. By way of example, would 
> “Straße" be considered alphabetic? Would “二の句” be considered alphabetic?
> 
> Strangely, Python considered the Japanese characters as being alphabetic but 
> will not accept “Straße” as a valid string. (I suspect this is due to some 
> problem relating to Locale..
> 
>  >>> "二の句".isalpha()
> True
> >>> “Straße".isalpha()
>   File "<stdin>", line 1
>     “Straße".isalpha()
>           ^
> SyntaxError: invalid character in identifier
> 
> Clearly, trying to identify “Unicode” alphabetic characters is far from 
> straightforward, though it may well be useful in processing some language 
> texts.
> 
> Peter
> 
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/BC855B5D-80BF-458B-A2D2-9570B0436646%40gmail.com.

Re: [racket-users] Are Regular Expression classes Unicode aware?

Reply via email to