Re: Searching Textile Documents

Erik Hatcher Wed, 23 Nov 2005 12:30:36 -0800


On 23 Nov 2005, at 14:30, Alan Chandler wrote:

1) The Analyser


First you'll have to spell it the US English way :)

Since the body has some special syntax, I assume I have to extendthe analyserto skip the special symbols etc. Has anyone done this already? Isthere astandard place to look? If not, do I have to start again fromscratch, or canI just "configure" an existing one? (In particular, I have aroutine whichwill take a textile input string and produce an html output string- so couldI use the HTMLParser in the demo - alternatively JavaCC - is thatsomething I
could use? - just came across it whilst writing this mail)

I don't know of a Textile analyzer - it looks like you could simplyconfigure all of its special symbols as a list of stop words and handit to StandardAnalyzer's constructor. You could go to the troubleof converting to HTML and then parse that, but that would be overkilland of course slower.

I ultimately want to put a summary of the text on the front portionof my website. In order to calculate where the split is, and therefore howmany
articles to place it would be useful as I am analysing it to get some
statistics like where is the end of the first paragraph. Is therea "hook"that I can plug into to get that information out (I scanned thejavadocs, but
I can't find anything obvious).

No, there is nothing special in an analyzer to help with this. It'dprobably be best to create a parser for Textile that can give youback the raw text without the markup and also give you back the firstparagraph.

2) Use of different field types.
I am stuggling to understand what field types I need for mydifferent fields.


It really all depends on your searching and results display needs.

For instance, I will want to index all the body of the article, sothat thewords it contains show up in searches, and I will also want tooutput thesnippet around where the text is on a search page. However I caneasilyretrieve the article from the database given its ID. Would Itherefore makethe ID of the article a keyword, and the body of it unstored? andwould I
build a special space separated string of the (undetermined number of)
categories and make them normal.

All of those options are possible and there is no Lucene "best way"to do it. You could easily use Lucene itself as the entire blogstorage mechanism if you like, even :)

As for categories - it depends on how you need them to beincorporated into the search. You may want to index themindividually (multiple per document, if desired) as Field.Keyword()so they aren't analyzed.


        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching Textile Documents

Reply via email to