Highlighting of original documents

Oystein Reigem Tue, 13 Mar 2007 06:59:50 -0800

Hi,

I want to implement fulltext search on a collection of documents. I tryto figure out which system is the better choice - eXist, or Lucene, orsome combination of the two. I have some knowledge of eXist, but don'tknow too much about Lucene.

I'd like to display the result of a search as a list ofexcerpts/snippets with highlighted search words. When the user clicks anitem in the result list to bring up the document in full, I'd like tohave search words highlighted in the full document as well.

The document collection is very diverse. There are pure text documentsand well-formed XML and HTML documents, but unfortunately also HTMLdocuments that are not quite well-formed, Word documents and PDFs. Manyof the formats go beyond what eXist and Lucene can handle, and I realisesome conversion, or text extraction, is necessary. As far as I knowLucene can only index and search pure text (and fields), so thedocuments must be run through appropriate filters extracting the text(and field values). Afterwards fulltext search is possible.

But what about highlighting? I know it is possible to get highlightingin the pure text version, but what about the original document, when theoriginal document is something else than pure text, e.g, a simple XMLdocument? Is it at all possible to get the search words tagged in theXML document?


I assume not, but ask anyway. :-)

Cheers,

- Øystein -


--
Øystein Reigem, The department of culture, language and information technology (Aksis), Allegt 
27, N-5007 Bergen, Norway. Tel: +47 55 58 32 42. Fax: +47 55 58 94 70. E-mail: <[EMAIL 
PROTECTED]>. Home tel: +47 56 14 06 11. Mobile: +47 97 16 96 64. Home e-mail: <[EMAIL 
PROTECTED]>. Aksis home page: <www.aksis.uib.no>.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Highlighting of original documents

Reply via email to