Re: Text extraction from ms word doc

2010-01-11 Thread Karl Wettin
Have you tried antiword? http://www.winfield.demon.nl/ karl 11 jan 2010 kl. 21.04 skrev maxSchlein: I was looking for an option for Text extraction from a word doc. Currently I am using POI; however, when there is a table in the doc, for each column POI brings back a . The whites

Text extraction from ms word doc

2010-01-11 Thread maxSchlein
I was looking for an option for Text extraction from a word doc. Currently I am using POI; however, when there is a table in the doc, for each column POI brings back a . The whitespace analyzer is not filtering out this character. So whatever word or phrase that is the last word or phrase wi

Index corruption using Lucene 2.4.1 - thread safety issue?

2010-01-11 Thread Frank Geary
Hi, I'm using Lucene 2.4.1 and am seeing occasional index corruption. It shows up when I call MultiSearcher.search(). MultiSearcher.search() throws the following exception: ArrayIndexOutOfBoundsException. The error is: Array index out of range: ### where ### is a number representing an index

Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-11 Thread Michael McCandless
Super! Thanks for bringing closure. Mike On Mon, Jan 11, 2010 at 12:55 PM, Yuliya Palchaninava wrote: > Thanks again. > > Disabling norms, where it was possible without influencing the search quality, > has solved the problem: > - The not optimized version of the index has become smaller. > - T

AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-11 Thread Yuliya Palchaninava
Thanks again. Disabling norms, where it was possible without influencing the search quality, has solved the problem: - The not optimized version of the index has become smaller. - The optimized index has practically the same size as the not optimized one. Yuliya > -Ursprüngliche Nachricht---

Field creation with TokenStream and stored value

2010-01-11 Thread Benjamin Heilbrunn
Hey out there, in lucene it's not possible to create a Field based on a TokenStream AND supply a stored value. Is there a reason why a Field constructor in the form of public Field(String name, TokenStream tokenStream, String storedValue) does not exist? I am using trees of TeeSinkTokenFilter

Re: Highlight the whole sentence instead of the partial matching terms

2010-01-11 Thread Sanne Grinovero
If you're searching for terms "giving" and "and", it will only highlight those terms, not the whole sentence.. that's how the highlighter is meant to work: highlight what the user did query. Also there's no built-in concept of sentence. regards, Sanne 2010/1/11 Li Leon : > Just figured out, misse

Re: Highlighter doesn't highlight wildcard queries after updating to 2.9.1/3.0.0

2010-01-11 Thread Mohsen Saboorian
changing MultiTermRewriteMethod fixed all previous incompatibility issue. After setting this: myQueryParser.setMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE); highlighter get compatible with rewrite, query.rewrite().toString() works as before and scoring works fine for wildc