Lucene 4 - POS and Syntactic Tagging

Mark McGuire Wed, 14 Mar 2012 09:38:54 -0700

I'm working on a project where I need to tag both the part of speech andother syntactic information on tokens so that this information issearchable. I have read the threads on the mailing list regarding partof speech tagging here<http://mail-archives.apache.org/mod_mbox/lucene-java-user/201105.mbox/%3cbanlktimwqcq_gf2pxe8hyc_r75ncwdr...@mail.gmail.com%3E>and the many responses to similar questions. To me, inserting 0increment tokens seems rather clunky, especially when TypeAttributesappear to be what one would want to use. Does Lucene do anything extrawhen the Type is set to or not set to its default, "word"? Is itpossible to write a search that uses multiple attributes fromTokenAttributes (ie a search that searches for CharTermAttribute "dog"followed by a TypeAttribute of verb)?

Also if I were to use 0 increment tokens for tagging, would data likedocument length or sumTotalTermFreq be different from a document indexedwithout these tags? How would I counteract these differences if any occur?


Thanks,
Mark McGuire

Lucene 4 - POS and Syntactic Tagging

Reply via email to