Re: Filtering a SpanQuery

2008-05-12 Thread Paul Elschot
Op Monday 12 May 2008 09:06:36 schreef Eran Sevi: > Thanks Paul, > > I'll give your code sample a try. > I still think that calling getSpans (the first line of code) that > returns millions of results is going to be much slower than calling > getSpans that's going to return only a few thousands of

Re: Question about startOffset and endOffset

2008-05-12 Thread Brendan Grainger
Hi Erick, Thanks for the reply. The use case I have is this: Say you have a synonym expansion like this: ac -> air conditioning And to keep it simple, a document where the first term is ac. When analyzing the document I currently create a token stream that looks something like this for the

Re: Numerical Range Query

2008-05-12 Thread Erick Erickson
Are you using NumberTools both at index and query time? Because this works exactly as I expect import org.apache.lucene.index.IndexWriter; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import

Re: Numerical Range Query

2008-05-12 Thread Dan Hardiker
Erick Erickson wrote: Although I'm a bit puzzled by what you're actually getting back. You might try using Luke to look at your index to see what's there. I've looked through with Luke and it doesn't look like much has changed between using NumberTools and not. NumberTools definitely does some

Re: Numerical Range Query

2008-05-12 Thread Erick Erickson
Yep, lucene works with strings, not numbers so the fact that you're not getting what you expect is expected . Although I'm a bit puzzled by what you're actually getting back. You might try using Luke to look at your index to see what's there. See the NumberTools class for some help here... B

Numerical Range Query

2008-05-12 Thread Dan Hardiker
Hi, I've got an application which stores ratings for content in a Lucene index. It works a treat for the most part, apart from the use-case I have for being able to filter out ratings that have less than a given number of rates. It kinda works, but seems to use Alpha ranging rather than Numer

Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

2008-05-12 Thread Karl Wettin
Lukas Vlcek skrev: Hi, I need to find a reliable way how to extract content out of Word, Excel and PowerPoint formats prior to indexing and I am not sure if POI is the best way to go. Can anybody share experience with POI and/or other [commercial] Java library for text extraction from MS formats

Re: Question about startOffset and endOffset

2008-05-12 Thread Karl Wettin
Erick Erickson skrev: Offhand, I expect this will affect up span queries, phrase queries, and who knows what else? Maybe scoring? I belive that the offsets are just meta data stored with the term vectors, used by the highlighter et c. Phrase and span queries use term position in the stream (p

Re: Question about startOffset and endOffset

2008-05-12 Thread Erick Erickson
Is this a theoretical question or is there a use-case you're trying to support? If the latter, a statement of the problem you're trying to solve would be helpful. If the former, setting all your start offsets to 0 seems wrong. You're essentially saying that all tokens are at the beginning of the d

Question about startOffset and endOffset

2008-05-12 Thread Brendan Grainger
Hi, I have a TokenStream that inserts synonym tokens into the stream when matched. One thing I am wondering about is what is the effect of the startOffset and endOffset. I have something like this: Token synonymToken = new Token(originalToken.startOffset(), originalToken.endOffset(), "SYN

Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

2008-05-12 Thread Nick Burch
On Mon, 12 May 2008, Lukas Vlcek wrote: I need to find a reliable way how to extract content out of Word, Excel and PowerPoint formats prior to indexing and I am not sure if POI is the best way to go. Can anybody share experience with POI and/or other [commercial] Java library for text extracti

posting lists of index are sorted?

2008-05-12 Thread Miguel Costa
Hi all, I have two questions related to the Lucene ranking. 1) Does anyone know how the posting lists (term -> doc1 doc2 doc3) from the index are sorted? It is used a TFxIDF value, the boost value or none to sort documents (doc1 doc2 doc3)? Does Lucene compute the ranking for all the documents

Re: confused about an entry in the FAQ

2008-05-12 Thread Stephane Nicoll
I tried all this and I am confused about the result. I am trying to implement an hybrid query handler where I fetch the IDs from a database criteria and the IDs from a full text lucene query and I intersect them to return the result to the user. The database query and the intersection works fine ev

Search and retrieve the line data from the File

2008-05-12 Thread Madan Narra
Hi All, I am very much new to Lucene and want to extend my skills over this tool But i am in need of a quick assignment which i would need to complete soon...so haven't got much time to read over the docs/books over net.. So please suggest how can i archive the below task and the rest i can

[ANNOUNCE] Lucene Java 2.3.2 release available

2008-05-12 Thread Michael Busch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Release 2.3.2 of Lucene Java is now available! This release contains fixes for bugs found in 2.3.1. It does not contain any new features, API or file format changes, which makes it fully compatible to 2.3.0 and 2.3.1. The detailed change log is at:

Re: Filtering a SpanQuery

2008-05-12 Thread Eran Sevi
Thanks Paul, I'll give your code sample a try. I still think that calling getSpans (the first line of code) that returns millions of results is going to be much slower than calling getSpans that's going to return only a few thousands of results. Since the filtering is only performed after calling