Hi, I would like to associate information (or labels) with each word or a range of words in a document. Information such as this word is a noun, that word is a verb, this period marks the end of a sentence, "kick the bucket" is a contiguous phrase, "white house" is a location and so on. I am seeking a good representation for such information so that they can be easily stored as additional fields in a lucene document, and easily recovered after a search. For the more technically inclined, this would allow me to store part-of-speech tags, chunk tags, sentence boundary markers and parse trees for every indexed document.
These additional information will enable Lucene to perform additional post-processing on retrieved documents for various purposes such as information extraction, summarization, question answering, etc... Is there any available api? If not, I would appreciate any suggestions and tips on how such information can best be stored in a Lucene document. Regards, Dave.