Re: Regular expression for latin1 letters?

Erik Sat, 13 May 2006 04:19:47 -0700

Dr.Ruud wrote:

Erik schreef:

I need to recognize latin1 letters in a regexp. How is it done? The
reason is that I want to fix a program called code2html, which is
written in Perl. I have the following regular expression for Ada
identifiers:
\\b[a-zA-Z](_?[a-zA-Z0-9])*\\b

but this is wrong because Ada identifiers include any latin1 letters,
not just a-z and A-Z. Anyone knows how?


Thank you for your help!

For [a-zA-Z] you can use [[:alpha:]], see `perldoc perlre`.

OK, it seems to work the same.

Why are the double backslashes there? I assume the \b's are meant as
word boundaries.

I do not know. That is just the way the file was written by someoneelse, see:

http://code2html.cvs.sourceforge.net/code2html/code2html/code2html?revision=1.13&view=markup

 /\b[[:alpha:]](?:_?[[:alpha:]0-9])*\b/

The (?:...) is a non-capturing group.

OK, a performance improvement that does not change the output I suppose.

If you dont want to allow "a_", you can make it

 /\b[[:alpha:]]+(?:_?[[:alpha:]0-9]+)*\b/

That does not seem to be necessary, because the previous expressionaready seems to exclude "a_".

It is a good idea to put an

 use encoding 'latin1';

in the top of your source, because the default is utf8 (which is the
Perl-variant of UTF-8).

I tried it but then I get a lot of errors when I run the script:

<FILENAME>: Malformed UTF-8 character (unexpected non-continuation byte0x00, immediately after start byte 0xe5) in pattern match (m//) at<FILENAME> line 841.

That "use" further limits what [:alpha:] and [:digit:] or \d will match,
so you can even change your regexp to:

 /\b[[:alpha:]]+(?:_?[[:alpha:]\d]+)*\b/

and then to:

 /\b[[:alpha:]]+(?:_?[^\W_]+)*\b/

OK, [^\W_] means "match a character that is not a (nonword-character or_)". Tricky with that double negation, but it seems to be the shortestexpression.

If an ending "_" is no problem, you can make it

 /\b[[:alpha:]]\w*/

That would be a problem since it is illegal for identifier names. Thisseems to be the shortest correct expression (that is optimized withnon-capturing group):

\b[[:alpha:]](?:_?[^\W_])*\b

Now I just have to figure out how to use encoding 'latin1'.

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Regular expression for latin1 letters?

Reply via email to