G'day, I'm in the throes of a massive rewrite of "hstbm", which combined a very-quick-and-dirty lexer, no parser, and an optimised Boyer-Moore-style search where I had made some incremental improvements. The only release is at savannah.non-gnu.org:
https://savannah.nongnu.org/projects/hstbm It's been over six years since the first and only release; Lua fans will note that I have had another project that has been active over that time, intended to help people use a scientific/technical toolkit on a range of GNU/Linux machines. -- I'm now in the process of trialling an all-singing, all-dancing lexer, with the philosophy that it tries to capture the pattern syntax and semantics, without resorting to parser constructs such as an AST. [I'm currently at a hairy point where the meaning of characters such as "^" can vary, based on constructs such as "(" (start-of-group) and/or "|" (alternation)... where does lexing stop and parsing start?!] One thing that is captured is predicates e.g. relating to IS_WORD: "IS_WORD_YES" (0x01), "IS_WORD_NO" (0x10), and "IS_WORD_MAYBE" (0x11). I've found some patterns containing word start/end boundary checks that are impossible to match in practice, e.g.: a\<b [abc]\>[def] Grep does not recognise these cases, and so spends time ploughing through the text for a match that can never occur. My lexing code, in contrast, sees the "IS_WORD_YES" "\<" "IS_WORD_YES" (or, equivalently, pairs of "IS_WORD_NO"), and arranges the lexical token stream such that the very first token is (effectively) MATCH_FAILED -- without any effort to inspect the haystack buffer. This can reduce runtimes for large haystack input from seconds to milliseconds. While this is not a terribly common case, it's an easy item to check for; it's possible that, in the future, patterns may become less hand-crafted and more machine-crafted, and so this case may become more relevant. cheers, sur-behoffski (Brenton Hoff) programmer, Grouse Software ["sur-" means "meta-", it's a commentary on a peculiar Australian event: See "Tony Abbott" + "Captain's Pick" + "Prince Philip". Absolutely no disrespect is intended to Honour-receivers at any level; I am grateful for your service, and how you have enriched society.]