On 18 Sep 2010, at 09:15, Hans Lodder wrote:
I am currently performing a Seach Engine Optimization (SEO) of HTML
web-pages of my web-site (on Win XP Home SP3). In order to do that
it is important to know, which 3 words are used most frequently on
the page. So I wrote a cross referencer (in C) to find those. The
2nd step is find the 3 most frequently used word groups, consisting
of 2 words. The results of both should be combined.
Now I have several possibilities. It is easy to do this in C as
well. Alternatives are using flex, or the combination of flex and
bison.
To have Flex identify a word is easy:
[-0-9A-Za-z]+
So is the identification of 2 words:
[-0-9A-Za-z](' '|\t)[-0-9A-Za-z]
The easiest way to implement this is to write 2 programs, and
manually combine the result.
Now my question is: Can both be combined in 1 Flex, or Flex and
Bison program. Flex will try to satisfy the longest match, so it
will not find the single word. Does this imply that I should
introduce some functionality like a 'Moving Average Filter'? Are
there better solutions?
In a common Flex/Bison setup, the lexer finds the identifiers which
are handed over to the parser, though one may do it otherwise, would
need arise. So translated to your case, the Flex generated lexer would
find the words handed over to the Bison generated parser words.
But you might want be able to identify word sequences with overlap,
like in a sequence of words w_1, w_2, w_3, ..., finding both w_1 w_2
and w_2 w_3. The parser/lexer combination consumes the input without
backtracking, so you need to do this in the code of the actions.
Since the parsing is very simple, you might be better off with doing
it all in a high level language, for example Haskell (using Hugs and
GHC/GHCi) or perhaps Perl, cutting development time.
_______________________________________________
help-bison@gnu.org http://lists.gnu.org/mailman/listinfo/help-bison