On 1 March 2016 at 20:35, Marc André Tanner <m...@brain-dump.org> wrote: > As an example, swapping two words with > > ,x[a-zA-Z]+/{ > g/fred/ v/...../ c/jim/ > g/jim/ v/..../ c/fred/ > } > > which is mentioned in both the sam tutorial and the cheatsheet would > no longer work.
That's true; I suppose it depends how useful that is in general. It might be nice to have both behaviours really, one 'seq' and one 'par' (cf. ';' and '&' in sh). The former would suffice for most use cases, and would be much faster. The drawback of this being that we'd have to implement the parallel version too anyway (though we could leave that for later). On 1 March 2016 at 20:56, Marc André Tanner <m...@brain-dump.org> wrote: > Misunderstanding on my part I guess. These are just limitations of > the regex(3) API but have no particular relevance for *structural* > regular expressions. I.e. it is possible to build a structural > regexp library on top of this crippled interface it probably just > wont be particularly efficient. Fair enough. It has those limitations in common with every other regular expression library API I know of, though. Of course, it would be theoretically possible to build structural regular expressions on top of regex(3), but it would be an abomination. > For others wanting to take a look, I guess you are referring to > the following papers: > > - Two-pass greedy regular expression parsing > - Optimally Streaming Greedy Regular Expression Parsing Yes indeed. > What is in your opinion the current state of the art technique/algorithm? Well, I think the 'two-pass' approach, described in the first of those papers, is particularly interesting. There's been a big resurgence of interest in Thompson's VM-style approach to regular expressions, probably mostly because of Russ Cox's excellent write up on them. Which is fantastic, because it's clearly superior to the backtracking approach. But all of that, all of the benefits, are sort of thrown out the window as soon as you start looping over the matches of a sufficiently nondeterministic expression. Consider the following two structural regular expressions, which in this file have identical matches: $ time ssam 'x/./' rc.1 >/dev/null real 0m0.020s user 0m0.008s sys 0m0.004s $ time ssam 'x/(.*\n)+\n|./' rc.1 >/dev/null real 0m9.730s user 0m9.672s sys 0m0.048s The matches are identical, and yet the latter takes a vastly greater amount of time -- quadratic to the size, in fact. I was originally going to use my dictionary file as the example, but it never finished and I had to kill sam. Yet, using the 'two-pass' approach, this could be done in linear time. The 'lean log' they use in their paper would take a bit more space, about one bit per byte for the above example, but I think if we do it per block we should be able to cut down on that massively, perhaps as little as one bit per state per block, which would obviously be much more manageable. I haven't thought so much about the 'optimally streaming' part yet. I don't think it would be necessary to go to the lengths they do in their paper, since we don't actually need strict optimality, but it would be nice if we could reduce the amount of data we have to read in order to get a match. One nice benefit of this is it would strike off the bug identified in ssam(1): "Ssam consumes all of standard input before running the script." If we were to stream data through a fairly lazy regular expression engine then this would hopefully be pretty easy to fix in a future ssam implementation. Thanks, cls