I am using a custom analyzer:
public TokenStream tokenStream(String fieldName, Reader reader) {
StandardTokenizer tokenStream = new StandardTokenizer(reader);
tokenStream.setMaxTokenLength(maxTokenLength);
TokenStream result = new ASCIIFoldingFilter(tokenStream);
Lets say we have 4 users : U1, U2, U3 and U4. Each user has a title and set
of documents created by him/her. Using this info, we can come up with a term
vector ( interest vector ) which would contain a set of top terms ( that
appeared in his/her docs ) along with frequency. So conceptually, we get
t;token: "+t));
t = ts.next();
}
But I would need to enhance it to include
- Split on hyphen,semicolon etc
- stemming ( porter )
- synonyms
Thanks
joe_coder wrote:
>
> Grant, thanks for responding.
>
> My issue is that I am not planning to use lucene ( as I d
alize that I would need to do some
preprocessing to remove stopwords, stem words and also check for synonyms.
So wondering if there is already such code present in lucene ( or any other
project ) that I can use directly.
Thanks!
Grant Ingersoll-6 wrote:
>
>
> On Aug 13, 2009, at 7:40
I was wondering if there is any way to directly use Lucene API to extract
terms from a given string. My requirement is that I have a text document for
which I need a term frequency vector ( after stemming, removing stopwords
and synonyms checks ). The result needs to be the terms and frequency.
I