On Sep-01, Dan Sugalski wrote: > > This is a list of the semantics that I see as needed for a regex > engine. When we have 'em, we'll map them to string ops, and may well > add in some special-case code for faster access. > > *) extract substring > *) exact string compare > *) find string in string > *) find first character of class X in string > *) find first character not of class X in string > *) find boundary between X and not-X > *) Find boundary defined by arbitrary code (mainly for word breaks)
Huh? What do you mean by "semantics"? The only semantics needed are the minimum necessary to answer the question "is the fred at offset i equal to the fred X?" (Sorry, not sure if fred is actually character or codepoint or whatever, and is probably all of them at different levels.) We also almost certainly need to be able to do character class comparisons, although if you assume that you can always transcode to what the regex was compiled with, then you don't even need that -- instead, you need to be able to convert to something like a difference list of numbered freds. But if we're talking about semantics, then yes you need the character class manipulation. Everything else in this list sounds like optimizations to me, and probably not the right optimizations (I don't think it's possible to predict what will be useful yet.) For other things that parrot will be used for, I suspect that the first 3 will be needed. I'm curious as to how you came up with that list; it seems to imply a particular way of implementing the grammar engine. I would expect all of that, barring certain optimizations, to be done directly with existing pasm instructions. There will be a need for saving a stack of former values of hypothetical variables, which can also be done with pasm ops but might interact with overloaded assignment or something wacky like that.