preserving markup of content ?

Sulman Sarwar Wed, 07 Apr 2010 20:04:38 -0700

Hi All,

I am working on some language data and i need to index/search it. I
have used lucene for indexing plain text documents before as well (no
fancy tricks, just plain text indexing). The data that i have now is
transcribed text and is heavily marked up. (Its mostly conversations
and interviews). I can easily remove the markup and extract the text
and feed it to a lucene indexer but i need to preserve some important
markups so that at the time of retrieval the text can make some sense.
Now if i leave the required markup intact and index the documents, i
fear the markup will get tokenized too and will become searchable. I
dont want the markup to be searched but i need to keep it somehow
attached with the actual text to make the retrieval process easy. Can
you suggest me what/how to do it? Correct me if i am wrong. :)


Thanks for the help.

Sulman.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

preserving markup of content ?

Reply via email to