On 23 Nov 2005, at 14:30, Alan Chandler wrote:
1) The Analyser
First you'll have to spell it the US English way :)
Since the body has some special syntax, I assume I have to extend
the analyser
to skip the special symbols etc. Has anyone done this already? Is
there a
standard place to look? If not, do I have to start again from
scratch, or can
I just "configure" an existing one? (In particular, I have a
routine which
will take a textile input string and produce an html output string
- so could
I use the HTMLParser in the demo - alternatively JavaCC - is that
something I
could use? - just came across it whilst writing this mail)
I don't know of a Textile analyzer - it looks like you could simply
configure all of its special symbols as a list of stop words and hand
it to StandardAnalyzer's constructor. You could go to the trouble
of converting to HTML and then parse that, but that would be overkill
and of course slower.
I ultimately want to put a summary of the text on the front portion
of my web
site. In order to calculate where the split is, and therefore how
many
articles to place it would be useful as I am analysing it to get some
statistics like where is the end of the first paragraph. Is there
a "hook"
that I can plug into to get that information out (I scanned the
javadocs, but
I can't find anything obvious).
No, there is nothing special in an analyzer to help with this. It'd
probably be best to create a parser for Textile that can give you
back the raw text without the markup and also give you back the first
paragraph.
2) Use of different field types.
I am stuggling to understand what field types I need for my
different fields.
It really all depends on your searching and results display needs.
For instance, I will want to index all the body of the article, so
that the
words it contains show up in searches, and I will also want to
output the
snippet around where the text is on a search page. However I can
easily
retrieve the article from the database given its ID. Would I
therefore make
the ID of the article a keyword, and the body of it unstored? and
would I
build a special space separated string of the (undetermined number of)
categories and make them normal.
All of those options are possible and there is no Lucene "best way"
to do it. You could easily use Lucene itself as the entire blog
storage mechanism if you like, even :)
As for categories - it depends on how you need them to be
incorporated into the search. You may want to index them
individually (multiple per document, if desired) as Field.Keyword()
so they aren't analyzed.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]