On Aug 11, 2009, at 5:27 PM, xs2Abhishek wrote:
Hi,
I am trying to make a decision on weather or not I can use Lucene
for my
requirements, which mainly include data tagging. I have to be able
to parse
or index a .txt file and then be able to extract text accordingly.
For e.g
if the input document has some text like: "Location: New York" , so
for this
input I should be able to extract "New York" if key word Location is
present. I am trying to learn about Lucene and looked into
"tokensFromAnalysis(analyzer, text)". But i'm still not sure how I
could
extract data using lucene. Can I use queries to extract this piece of
information?
You will likely need to write your own TokenFilter that can do the
extraction. It is feasible to plug in something like OpenNLP or other
extraction toolkits into the Analysis stream and then provide these
capabilities. That, combined with the Tee/Sink Tokenizer/TokenFilter
capabilities can make for some lightweight, but still powerful
extraction capabilities. You might also look at UIMA, which is in the
Apache Incubator.
Any help on this would be appreciated.
Thanks,
Abhishek
--
View this message in context:
http://www.nabble.com/How-to-tune-Analyzer-for-Text-Extraction-tp24926082p24926082.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org