G'day,

I'm in the throes of a massive rewrite of "hstbm", which combined
a very-quick-and-dirty lexer, no parser, and an optimised
Boyer-Moore-style search where I had made some incremental
improvements.  The only release is at savannah.non-gnu.org:

      https://savannah.nongnu.org/projects/hstbm

It's been over six years since the first and only release; Lua
fans will note that I have had another project that has been
active over that time, intended to help people use a
scientific/technical toolkit on a range of GNU/Linux machines.

--

I'm now in the process of trialling an all-singing, all-dancing
lexer, with the philosophy that it tries to capture the pattern
syntax and semantics, without resorting to parser constructs
such as an AST.  [I'm currently at a hairy point where the meaning
of characters such as "^" can vary, based on constructs such as
"(" (start-of-group) and/or "|" (alternation)... where does lexing
stop and parsing start?!]

One thing that is captured is predicates e.g. relating to IS_WORD:
         "IS_WORD_YES"   (0x01),
         "IS_WORD_NO"    (0x10), and
         "IS_WORD_MAYBE" (0x11).

I've found some patterns containing word start/end boundary checks
that are impossible to match in practice, e.g.:

        a\<b
        [abc]\>[def]

Grep does not recognise these cases, and so spends time ploughing
through the text for a match that can never occur.  My lexing code,
in contrast, sees the "IS_WORD_YES" "\<" "IS_WORD_YES" (or,
equivalently, pairs of "IS_WORD_NO"), and arranges the lexical
token stream such that the very first token is (effectively)
MATCH_FAILED -- without any effort to inspect the haystack buffer.
This can reduce runtimes for large haystack input from seconds to
milliseconds.

While this is not a terribly common case, it's an easy item to
check for; it's possible that, in the future, patterns may become
less hand-crafted and more machine-crafted, and so this case may
become more relevant.

cheers,

sur-behoffski (Brenton Hoff)
programmer, Grouse Software

["sur-" means "meta-", it's a commentary on a peculiar Australian
event:  See "Tony Abbott" + "Captain's Pick" + "Prince Philip".
Absolutely no disrespect is intended to Honour-receivers at any level;
I am grateful for your service, and how you have enriched society.]



Reply via email to