Re: Regular expression for latin1 letters?

Dr.Ruud Sat, 13 May 2006 06:07:52 -0700

Erik schreef:
> Dr.Ruud:
>> Erik:

>>> I need to recognize latin1 letters in a regexp. How is it done? The
>>> reason is that I want to fix a program called code2html, which is
>>> written in Perl. I have the following regular expression for Ada
>>> identifiers:
>>> \\b[a-zA-Z](_?[a-zA-Z0-9])*\\b
>>>
>>> but this is wrong because Ada identifiers include any latin1
>>> letters, not just a-z and A-Z. Anyone knows how?
>>
>> For [a-zA-Z] you can use [[:alpha:]], see `perldoc perlre`.
>
> OK, it seems to work the same.


Yes, but [:alpha:] matches a few thousand extra characters in utf8-mode.


>> Why are the double backslashes there? I assume the \b's are meant as
>> word boundaries.
>
> I do not know. That is just the way the file was written by someone
> else, see:
>
http://code2html.cvs.sourceforge.net/code2html/code2html/code2html?revision=1.13&view=markup

OK, maybe there is an interpolation step later, that eats the extra
backslashes.


>>   /\b[[:alpha:]](?:_?[[:alpha:]0-9])*\b/
>> The (?:...) is a non-capturing group.
>
> OK, a performance improvement that does not change the output I
> suppose.

Right. That code2html is full of those superfluous capture groups, so I
guess there is no real need to use (?:...).


>> If you dont want to allow "a_", you can make it
>>   /\b[[:alpha:]]+(?:_?[[:alpha:]0-9]+)*\b/
>
> That does not seem to be necessary, because the previous expression
> aready seems to exclude "a_".

Oops, you're right, the [a-zA-Z](_?[a-zA-Z0-9])* did it already.
I inserted the '+' there, because often more that 1 letter follows an
underscore.


>> It is a good idea to put an
>>    use encoding 'latin1';
>> in the top of your source, because the default is utf8 (which is the
>> Perl-variant of UTF-8).
>
> I tried it but then I get a lot of errors when I run the script:
> <FILENAME>: Malformed UTF-8 character (unexpected non-continuation
> byte 0x00, immediately after start byte 0xe5) in pattern match (m//)
> at <FILENAME> line 841.

Which version of Perl did you use?
That code2html source uses \0 in a few places. Maybe these error
messages can be silenced by a "no encoding" in the block.

I used perl, v5.8.6 built for i386-freebsd-64int, and inserted a "use
encoding 'latin1';" at the top, and then I made it colorize itself:

$ perl code2html.pl code2html.pl code2html.html
code2html.pl: language mode not given. guessing...
code2html.pl: using 'perl'
$

And the resulting html-file seems OK.


>> That "use" further limits what [:alpha:] and [:digit:] or \d will
>> match, so you can even change your regexp to:
>>   /\b[[:alpha:]]+(?:_?[[:alpha:]\d]+)*\b/
>> and then to:
>>   /\b[[:alpha:]]+(?:_?[^\W_]+)*\b/
>
> OK, [^\W_] means "match a character that is not a (nonword-character
> or _)". Tricky with that double negation, but it seems to be the
> shortest expression.

Yeah.


>> If an ending "_" is no problem, you can make it
>>   /\b[[:alpha:]]\w*/
>
> That would be a problem since it is illegal for identifier names. This
> seems to be the shortest correct expression (that is optimized with
> non-capturing group):
>     \b[[:alpha:]](?:_?[^\W_])*\b
>
> Now I just have to figure out how to use encoding 'latin1'.

With the x-modifier, you can use whitespace (and comments) to make the
regexp more readable:

 m/                    # match
    \b                 # word boundary (zero width)
    [[:alpha:]]        # alphabeticals (ISO-8859-1 encoding presumed)
    (?:                # start a non-capturing group
       _?              #   optional underscore
       [[:alnum:]]+    #   alphanumericals (ISO-8859-1 encoding
presumed)
    )*                 # 0 or 1 or more times
    \b                 # word boundary (zero width)
  /x                   # (see `perldoc perlre`)

These comments might seem too much, but look at the code2html source and
understand that there is a lot of improvement possible (at least on the
regexes side) by waking the interest of the people that read that
source.

-- 
Affijn, Ruud

"Gewoon is een tijger."



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Regular expression for latin1 letters?

Reply via email to