okay, so i'm very new to lucene, so it may be my bad, but i can get it to
index .txt files, and when trying to index word documents (using poi), the
program starts running and when it reaches a .doc file, i get the following
errors:
Exception in thread "main"
org.apache.poi.hpsf.IllegalPropertySe
Here goes,
I'm developing an application using lucene which will evaluate the
representativeness of a list of keywords within a collection of documents.
I'm doing this by indexing the documents and then, loading the list of
keywords and using the IndexReader Class and DefaultSimilarity, retrieving
teration,
> so every second is skipped... ?
>
> "chris.b" <[EMAIL PROTECTED]> wrote on 10/12/2007 12:58:15:
>
>>
>> Here goes,
>> I'm developing an application using lucene which will evaluate the
>> representativeness of a list of keywords w
I'm not even sure if it can be considered Named Entity Recognition, but what
the hell...
so here's my problem...
I was asked to retrieve a the named entities out of a collection of
documents, and I've thought of two ways of doing so (not sure if either of
them work)...
a) index the documents by w
is it possible to add a document to an index and, while doing so, get the
terms in that document? If so, how would one do this? :x
thanks :)
--
View this message in context:
http://www.nabble.com/Question-regarding-adding-documents-tp14656336p14656336.html
Sent from the Lucene - Java Users mail
Following your suggestion (I think), I built a tokenfilter with the following
code for next():
public final Token next() throws IOException {
Token newToken = input.next();
termText = newToken.termText();
Character tempChar = termText.charAt
Wrapping the whitespaceanalyzer with the ngramfilter it creates unigrams and
the ngrams that i indicate, while maintining the whitespaces. :)
The reason i'm doing this is because I only wish to index names with more
than one token.
--
View this message in context:
http://www.nabble.com/Basic-Nam
taking your example (text by John Bear, old.), the NGramAnalyzerWrapper
creates the following tokens:
text
text by
by
by John
John
John Bear,
Bear,
Bear, old.
I have managed to get rid of the error, but now it just doesn't add anything
to the index :s
I'm attaching the NGramAnalyzerWrapper and NG
solved it... i was using token.toString() instead of token.termText();
thanks for the help :)
--
View this message in context:
http://www.nabble.com/Basic-Named-Entity-Indexing-tp14291880p14715727.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
--
I'm sure this has been asked a few times before, but i searched and searched
and found no answer (apart from using luke), but I would like to know if
there's a way of retrieving the number of terms in an index.
I tried cycling through a TermEnum, but i doesn't do anything :|
--
View this message
10 matches
Mail list logo