On 23 Nov 2005, at 14:30, Alan Chandler wrote:
1) The Analyser

First you'll have to spell it the US English way :)

Since the body has some special syntax, I assume I have to extend the analyser to skip the special symbols etc. Has anyone done this already? Is there a standard place to look? If not, do I have to start again from scratch, or can I just "configure" an existing one? (In particular, I have a routine which will take a textile input string and produce an html output string - so could I use the HTMLParser in the demo - alternatively JavaCC - is that something I
could use? - just came across it whilst writing this mail)

I don't know of a Textile analyzer - it looks like you could simply configure all of its special symbols as a list of stop words and hand it to StandardAnalyzer's constructor. You could go to the trouble of converting to HTML and then parse that, but that would be overkill and of course slower.

I ultimately want to put a summary of the text on the front portion of my web site. In order to calculate where the split is, and therefore how many
articles to place it would be useful as I am analysing it to get some
statistics like where is the end of the first paragraph. Is there a "hook" that I can plug into to get that information out (I scanned the javadocs, but
I can't find anything obvious).

No, there is nothing special in an analyzer to help with this. It'd probably be best to create a parser for Textile that can give you back the raw text without the markup and also give you back the first paragraph.

2) Use of different field types.

I am stuggling to understand what field types I need for my different fields.

It really all depends on your searching and results display needs.

For instance, I will want to index all the body of the article, so that the words it contains show up in searches, and I will also want to output the snippet around where the text is on a search page. However I can easily retrieve the article from the database given its ID. Would I therefore make the ID of the article a keyword, and the body of it unstored? and would I
build a special space separated string of the (undetermined number of)
categories and make them normal.

All of those options are possible and there is no Lucene "best way" to do it. You could easily use Lucene itself as the entire blog storage mechanism if you like, even :)

As for categories - it depends on how you need them to be incorporated into the search. You may want to index them individually (multiple per document, if desired) as Field.Keyword() so they aren't analyzed.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to