Hi,
I want to implement fulltext search on a collection of documents. I try
to figure out which system is the better choice - eXist, or Lucene, or
some combination of the two. I have some knowledge of eXist, but don't
know too much about Lucene.
I'd like to display the result of a search as a list of
excerpts/snippets with highlighted search words. When the user clicks an
item in the result list to bring up the document in full, I'd like to
have search words highlighted in the full document as well.
The document collection is very diverse. There are pure text documents
and well-formed XML and HTML documents, but unfortunately also HTML
documents that are not quite well-formed, Word documents and PDFs. Many
of the formats go beyond what eXist and Lucene can handle, and I realise
some conversion, or text extraction, is necessary. As far as I know
Lucene can only index and search pure text (and fields), so the
documents must be run through appropriate filters extracting the text
(and field values). Afterwards fulltext search is possible.
But what about highlighting? I know it is possible to get highlighting
in the pure text version, but what about the original document, when the
original document is something else than pure text, e.g, a simple XML
document? Is it at all possible to get the search words tagged in the
XML document?
I assume not, but ask anyway. :-)
Cheers,
- Øystein -
--
Øystein Reigem, The department of culture, language and information technology (Aksis), Allegt
27, N-5007 Bergen, Norway. Tel: +47 55 58 32 42. Fax: +47 55 58 94 70. E-mail: <[EMAIL
PROTECTED]>. Home tel: +47 56 14 06 11. Mobile: +47 97 16 96 64. Home e-mail: <[EMAIL
PROTECTED]>. Aksis home page: <www.aksis.uib.no>.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]