Re: How to tune Analyzer for Text Extraction

Grant Ingersoll Wed, 12 Aug 2009 04:32:16 -0700


On Aug 11, 2009, at 5:27 PM, xs2Abhishek wrote:

Hi,
I am trying to make a decision on weather or not I can use Lucenefor myrequirements, which mainly include data tagging. I have to be ableto parseor index a .txt file and then be able to extract text accordingly.For e.gif the input document has some text like: "Location: New York" , sofor this
input I should be able to extract "New York" if key word Location is
present. I am trying to learn about Lucene and looked into
"tokensFromAnalysis(analyzer, text)". But i'm still not sure how Icould
extract data using lucene. Can I use queries to extract this piece of
information?

You will likely need to write your own TokenFilter that can do theextraction. It is feasible to plug in something like OpenNLP or otherextraction toolkits into the Analysis stream and then provide thesecapabilities. That, combined with the Tee/Sink Tokenizer/TokenFiltercapabilities can make for some lightweight, but still powerfulextraction capabilities. You might also look at UIMA, which is in theApache Incubator.

Any help on this would be appreciated.

Thanks,
Abhishek
--
View this message in context: 
http://www.nabble.com/How-to-tune-Analyzer-for-Text-Extraction-tp24926082p24926082.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to tune Analyzer for Text Extraction

Reply via email to