Erik schreef: > Dr.Ruud: >> Erik: >>> I need to recognize latin1 letters in a regexp. How is it done? The >>> reason is that I want to fix a program called code2html, which is >>> written in Perl. I have the following regular expression for Ada >>> identifiers: >>> \\b[a-zA-Z](_?[a-zA-Z0-9])*\\b >>> >>> but this is wrong because Ada identifiers include any latin1 >>> letters, not just a-z and A-Z. Anyone knows how? >> >> For [a-zA-Z] you can use [[:alpha:]], see `perldoc perlre`. > > OK, it seems to work the same.
Yes, but [:alpha:] matches a few thousand extra characters in utf8-mode. >> Why are the double backslashes there? I assume the \b's are meant as >> word boundaries. > > I do not know. That is just the way the file was written by someone > else, see: > http://code2html.cvs.sourceforge.net/code2html/code2html/code2html?revision=1.13&view=markup OK, maybe there is an interpolation step later, that eats the extra backslashes. >> /\b[[:alpha:]](?:_?[[:alpha:]0-9])*\b/ >> The (?:...) is a non-capturing group. > > OK, a performance improvement that does not change the output I > suppose. Right. That code2html is full of those superfluous capture groups, so I guess there is no real need to use (?:...). >> If you dont want to allow "a_", you can make it >> /\b[[:alpha:]]+(?:_?[[:alpha:]0-9]+)*\b/ > > That does not seem to be necessary, because the previous expression > aready seems to exclude "a_". Oops, you're right, the [a-zA-Z](_?[a-zA-Z0-9])* did it already. I inserted the '+' there, because often more that 1 letter follows an underscore. >> It is a good idea to put an >> use encoding 'latin1'; >> in the top of your source, because the default is utf8 (which is the >> Perl-variant of UTF-8). > > I tried it but then I get a lot of errors when I run the script: > <FILENAME>: Malformed UTF-8 character (unexpected non-continuation > byte 0x00, immediately after start byte 0xe5) in pattern match (m//) > at <FILENAME> line 841. Which version of Perl did you use? That code2html source uses \0 in a few places. Maybe these error messages can be silenced by a "no encoding" in the block. I used perl, v5.8.6 built for i386-freebsd-64int, and inserted a "use encoding 'latin1';" at the top, and then I made it colorize itself: $ perl code2html.pl code2html.pl code2html.html code2html.pl: language mode not given. guessing... code2html.pl: using 'perl' $ And the resulting html-file seems OK. >> That "use" further limits what [:alpha:] and [:digit:] or \d will >> match, so you can even change your regexp to: >> /\b[[:alpha:]]+(?:_?[[:alpha:]\d]+)*\b/ >> and then to: >> /\b[[:alpha:]]+(?:_?[^\W_]+)*\b/ > > OK, [^\W_] means "match a character that is not a (nonword-character > or _)". Tricky with that double negation, but it seems to be the > shortest expression. Yeah. >> If an ending "_" is no problem, you can make it >> /\b[[:alpha:]]\w*/ > > That would be a problem since it is illegal for identifier names. This > seems to be the shortest correct expression (that is optimized with > non-capturing group): > \b[[:alpha:]](?:_?[^\W_])*\b > > Now I just have to figure out how to use encoding 'latin1'. With the x-modifier, you can use whitespace (and comments) to make the regexp more readable: m/ # match \b # word boundary (zero width) [[:alpha:]] # alphabeticals (ISO-8859-1 encoding presumed) (?: # start a non-capturing group _? # optional underscore [[:alnum:]]+ # alphanumericals (ISO-8859-1 encoding presumed) )* # 0 or 1 or more times \b # word boundary (zero width) /x # (see `perldoc perlre`) These comments might seem too much, but look at the code2html source and understand that there is a lot of improvement possible (at least on the regexes side) by waking the interest of the people that read that source. -- Affijn, Ruud "Gewoon is een tijger." -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>