For example, I am able to do 
Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer
      TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some
text goes here"));
      Token t = ts.next();
      while (t!=null) {
        System.out.println("token: "+t));
        t = ts.next();
      }
  
But I would need to enhance it to include
- Split on hyphen,semicolon etc
- stemming ( porter )
- synonyms


Thanks


joe_coder wrote:
> 
> Grant, thanks for responding.
> 
> My issue is that I am not planning to use lucene ( as I don't need any
> search capability, atleast yet). All I have is a text document and I need
> to extract keywords and their frequency ( which could be a simple split on
> space and tracking the count). But I realize that I would need to do some
> preprocessing to remove stopwords, stem words and also check for synonyms.
> So wondering if there is already such code present in lucene ( or any
> other project ) that I can use directly.
> 
> Thanks!
> 
> 
> 
> Grant Ingersoll-6 wrote:
>> 
>> 
>> On Aug 13, 2009, at 7:40 AM, joe_coder wrote:
>> 
>>>
>>> I was wondering if there is any way to directly use Lucene API to  
>>> extract
>>> terms from a given string. My requirement is that I have a text  
>>> document for
>>> which I need a term frequency vector ( after stemming, removing  
>>> stopwords
>>> and synonyms checks ). The result needs to be the terms and frequency.
>> 
>> IndexReader.getTermFreqVector(), assuming you have indexed using Term  
>> Vectors.
>> 
>> 
>>>
>>> Is it possible to get this using any lucene API? ( As I see lucene  
>>> also
>>> needs to stem, remove stopwords, synonyms etc before indexing). Or  
>>> is this
>>> any java project that would help me in this?
>>> -- 
>>> View this message in context:
>>> http://www.nabble.com/Term-Extraction-tp24953406p24953406.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>> 
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>> 
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Term-Extraction-tp24953406p24954264.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to