Exposing regexp engine & compiled regexp's

Filipe Brandenburger Fri, 05 Jan 2001 14:25:31 -0800

Hello.

I have some ideas (actually a wishlist) for the regular expression
subsystem (that's what it'll be, right?). I would appreciate if
perl6's regexp engine exposes the maximum possible interfaces, and
I'll expose why I think this would be nice.


1.  I think it should be possible to have ``incomplete matches''.

    Regexp's are interpreted by a NFA, that is a state machine.
    I think it would be nice if, when I try to match a regexp against
    a string, and the string ends before the regexp matches, it
    would be possible to find out in which state the NFA has stopped,
    and it would also be possible to start another match from
    that state on.



Suppose I'm reading from a file on a per-line basis. Suppose that
I want to strip off XML-style comments from the file. What I need
is to match the /<!--.*?-->/ regular expression and substitute it
for, say, the empty string. The problem is if the file contains
multi-line comments, where the ``<!--'' is in one line and the ``-->''
is in another posterior line, this method will fail. If, when I try
the match, I see that I have an incomplete match, I could take
the state of the regexp engine, and start matching with it in the
next line, and so on, until the engine ends the match.

Other possibility is to tokenize a file `a la lex'. For example,
if I want to parse a C source code file, I break it in identifiers,
constants, reserved-words, comments, etc. If I have the whole file
in a string, I can break it using the /\G.../gc regular expressions,
but reading a whole file in memory can be a problem sometimes. If
we have incomplete matches, we can read a block of data from the file,
try to match it against the regexp's and, if necessary, read more
blocks of data from the file until a match is done.

Now I had another idea with the exposing of states of the NFA. It
would be possible to join several /\G.../gc regular expressions into
one, using |, and, after the match, use the state the NFA stopped to
determine which regexp was matched.

Other possibility is to implement the equivalent of awk's RS. awk has
a RS variable similar to perl5's $/, but awk's RS can handle regular
expressions instead of $/'s strings. With NFA's states exposed it would
be possible to implement the same behaviour of awk's RS.


The possibilities are many. I think it wouldn't be difficult to expose
NFA's states as I imagine it. Afterall, probably the C code inside of
perl5 has them exposed. I have yet worked with XS and the fact that the
regular expression engine isn't exposed in the XS interface let's me
a little disappointed. There were times when I wanted to do things I
say here and I realised that even with XS I would have to dig for perl's
regexp engine or even use an external regexp engine to do the job.



2.  I think compiled regexp's should be analysable by perl code.

    Regexp's in their compiled states are (a little simplification here)
    a tree. I think it would be nice if there were ways of traversing
    this tree, to find properties of the regular expressions.



This would help several things. For example, consider a module that
executes searches in a database table, matching people's names
against regular expressions. Suppose now that the database column
where the names are stored is indexed by prefix and suffix. If it
were possible to traverse regular expressions, the module would
be able to optimize searches like /^John/ and /Smith$/, using the
indexes to reduce the number of records that the regexp should match.
Of course it would have to search through all the records for a middle
name, like /F\./, but the possibility to optimize some cases by
inspecting the compiled regexp is a big win, at least for me.


Traversing a compiled regexp also leads to implementing custom regexp
engines, for example, a engine that calculates the difference or distance
between the string and what would match the regexp. As with the first
case, the possibilities are many.



I actually don't know perl5's regexp engine and compiler, but I believe
it doesn't escape from my model (NFA with states & a tree). I agree with
the fact that code talks louder than speech, but I would like to know
if there's something going in this regexp area and if there really is
interest in this. Anyway, if there's something going here, I would like
very much to join the team, if I can contribute a bit to it. If there's
nothing going on and the list considers my idea good, I will try to
`kind of' implement it as a perl5 extension (oh, boy! that'll be hard!!!)
If I do it, I hope I get help from you in doing it!

Thank's.

Branden
Exposing regexp engine & compiled regexp's

Reply via email to