Dr.Ruud wrote:
Erik schreef:
I need to recognize latin1 letters in a regexp. How is it done? The
reason is that I want to fix a program called code2html, which is
written in Perl. I have the following regular expression for Ada
identifiers:
\\b[a-zA-Z](_?[a-zA-Z0-9])*\\b
but this is wrong because Ada identifiers include any latin1 letters,
not just a-z and A-Z. Anyone knows how?
Thank you for your help!
For [a-zA-Z] you can use [[:alpha:]], see `perldoc perlre`.
OK, it seems to work the same.
Why are the double backslashes there? I assume the \b's are meant as
word boundaries.
I do not know. That is just the way the file was written by someone
else, see:
http://code2html.cvs.sourceforge.net/code2html/code2html/code2html?revision=1.13&view=markup
/\b[[:alpha:]](?:_?[[:alpha:]0-9])*\b/
The (?:...) is a non-capturing group.
OK, a performance improvement that does not change the output I suppose.
If you dont want to allow "a_", you can make it
/\b[[:alpha:]]+(?:_?[[:alpha:]0-9]+)*\b/
That does not seem to be necessary, because the previous expression
aready seems to exclude "a_".
It is a good idea to put an
use encoding 'latin1';
in the top of your source, because the default is utf8 (which is the
Perl-variant of UTF-8).
I tried it but then I get a lot of errors when I run the script:
<FILENAME>: Malformed UTF-8 character (unexpected non-continuation byte
0x00, immediately after start byte 0xe5) in pattern match (m//) at
<FILENAME> line 841.
That "use" further limits what [:alpha:] and [:digit:] or \d will match,
so you can even change your regexp to:
/\b[[:alpha:]]+(?:_?[[:alpha:]\d]+)*\b/
and then to:
/\b[[:alpha:]]+(?:_?[^\W_]+)*\b/
OK, [^\W_] means "match a character that is not a (nonword-character or
_)". Tricky with that double negation, but it seems to be the shortest
expression.
If an ending "_" is no problem, you can make it
/\b[[:alpha:]]\w*/
That would be a problem since it is illegal for identifier names. This
seems to be the shortest correct expression (that is optimized with
non-capturing group):
\b[[:alpha:]](?:_?[^\W_])*\b
Now I just have to figure out how to use encoding 'latin1'.
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>