For example, I am able to do Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some text goes here")); Token t = ts.next(); while (t!=null) { System.out.println("token: "+t)); t = ts.next(); } But I would need to enhance it to include - Split on hyphen,semicolon etc - stemming ( porter ) - synonyms
Thanks joe_coder wrote: > > Grant, thanks for responding. > > My issue is that I am not planning to use lucene ( as I don't need any > search capability, atleast yet). All I have is a text document and I need > to extract keywords and their frequency ( which could be a simple split on > space and tracking the count). But I realize that I would need to do some > preprocessing to remove stopwords, stem words and also check for synonyms. > So wondering if there is already such code present in lucene ( or any > other project ) that I can use directly. > > Thanks! > > > > Grant Ingersoll-6 wrote: >> >> >> On Aug 13, 2009, at 7:40 AM, joe_coder wrote: >> >>> >>> I was wondering if there is any way to directly use Lucene API to >>> extract >>> terms from a given string. My requirement is that I have a text >>> document for >>> which I need a term frequency vector ( after stemming, removing >>> stopwords >>> and synonyms checks ). The result needs to be the terms and frequency. >> >> IndexReader.getTermFreqVector(), assuming you have indexed using Term >> Vectors. >> >> >>> >>> Is it possible to get this using any lucene API? ( As I see lucene >>> also >>> needs to stem, remove stopwords, synonyms etc before indexing). Or >>> is this >>> any java project that would help me in this? >>> -- >>> View this message in context: >>> http://www.nabble.com/Term-Extraction-tp24953406p24953406.html >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >> using Solr/Lucene: >> http://www.lucidimagination.com/search >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > > -- View this message in context: http://www.nabble.com/Term-Extraction-tp24953406p24954264.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org