Op Monday 12 May 2008 09:06:36 schreef Eran Sevi:
> Thanks Paul,
>
> I'll give your code sample a try.
> I still think that calling getSpans (the first line of code) that
> returns millions of results is going to be much slower than calling
> getSpans that's going to return only a few thousands of
Hi Erick,
Thanks for the reply. The use case I have is this:
Say you have a synonym expansion like this:
ac -> air conditioning
And to keep it simple, a document where the first term is ac. When
analyzing the document I currently create a token stream that looks
something like this for the
Are you using NumberTools both at index and query time? Because
this works exactly as I expect
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import
Erick Erickson wrote:
Although I'm a bit puzzled by what you're actually getting back.
You might try using Luke to look at your index to see what's
there.
I've looked through with Luke and it doesn't look like much has changed
between using NumberTools and not. NumberTools definitely does some
Yep, lucene works with strings, not numbers so the fact that you're
not getting what you expect is expected .
Although I'm a bit puzzled by what you're actually getting back.
You might try using Luke to look at your index to see what's
there.
See the NumberTools class for some help here...
B
Hi,
I've got an application which stores ratings for content in a Lucene
index. It works a treat for the most part, apart from the use-case I
have for being able to filter out ratings that have less than a given
number of rates. It kinda works, but seems to use Alpha ranging rather
than Numer
Lukas Vlcek skrev:
Hi,
I need to find a reliable way how to extract content out of Word, Excel and
PowerPoint formats prior to indexing and I am not sure if POI is the best
way to go. Can anybody share experience with POI and/or other [commercial]
Java library for text extraction from MS formats
Erick Erickson skrev:
Offhand, I expect this will affect up span queries, phrase
queries, and who knows what else? Maybe scoring?
I belive that the offsets are just meta data stored with the term
vectors, used by the highlighter et c. Phrase and span queries use term
position in the stream (p
Is this a theoretical question or is there a use-case you're trying
to support? If the latter, a statement of the problem you're trying
to solve would be helpful.
If the former, setting all your start offsets to 0 seems wrong. You're
essentially saying that all tokens are at the beginning of the d
Hi,
I have a TokenStream that inserts synonym tokens into the stream when
matched. One thing I am wondering about is what is the effect of the
startOffset and endOffset. I have something like this:
Token synonymToken = new Token(originalToken.startOffset(),
originalToken.endOffset(), "SYN
On Mon, 12 May 2008, Lukas Vlcek wrote:
I need to find a reliable way how to extract content out of Word, Excel
and PowerPoint formats prior to indexing and I am not sure if POI is the
best way to go. Can anybody share experience with POI and/or other
[commercial] Java library for text extracti
Hi all,
I have two questions related to the Lucene ranking.
1) Does anyone know how the posting lists (term -> doc1 doc2 doc3) from the
index are sorted?
It is used a TFxIDF value, the boost value or none to sort documents (doc1
doc2 doc3)? Does Lucene compute the ranking for all the documents
I tried all this and I am confused about the result. I am trying to
implement an hybrid query handler where I fetch the IDs from a
database criteria and the IDs from a full text lucene query and I
intersect them to return the result to the user. The database query
and the intersection works fine ev
Hi All,
I am very much new to Lucene and want to extend my skills over this tool
But i am in need of a quick assignment which i would need to complete
soon...so haven't got much time to read over the docs/books over net..
So please suggest how can i archive the below task and the rest i can
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Release 2.3.2 of Lucene Java is now available!
This release contains fixes for bugs found in 2.3.1. It does not contain
any new features, API or file format changes, which makes it fully
compatible to 2.3.0 and 2.3.1.
The detailed change log is at:
Thanks Paul,
I'll give your code sample a try.
I still think that calling getSpans (the first line of code) that returns
millions of results is going to be much slower than calling getSpans that's
going to return only a few thousands of results. Since the filtering is only
performed after calling
16 matches
Mail list logo