RE: SweetSpotSimilarity

2012-02-15 Thread Chris Hostetter
: sloppyFreq(distance). hyperbolicTf() only comes into play if you : override the tf method in your own subclass to call it instead of the : baselineTf which it normally calls. I also didn't get what it was : trying to do. Correct, as documented... http://lucene.apache.org/core/old_versioned

Re: Why read past EOF

2012-02-15 Thread superruiye
My IndexWriter only create once and cached in memery. I restart tomcat this morning,and the index become 94M...But I restart yesterday serveral times ,it still too big... My deletion policy is in above reply,it only compare the timestamp,not actually deleting commits. -- View this message in cont

RE: SweetSpotSimilarity

2012-02-15 Thread Paul Allan Hill
I'd love to hear what you find out. I have been working with this also. I only changed the sweet spot to a slightly larger range than the one in the original paper (but kept the same steepness) and I tweaked the sloppy freq to not score multiple occurances of a phrase as strong as the they are i

Re: Indexing 100Gb of readonly numeric data

2012-02-15 Thread Pedro Ferreira
Thanks Eric, Yes, the limitations you pointed confirm my first feeling on it. Even if it is doable with Solr or Lucene, I would have to go deep inside of it to get the most out of it. About my RDBMS issues... there are 2 reasons: First, Im interested in this whole cloud crazyness. I love to work

Re: Indexing 100Gb of readonly numeric data

2012-02-15 Thread Erick Erickson
Actually, you might well have your index be larger than your source, assuming you're going to be both storing and indexing everything. There's also the "deep paging" issue, see: https://issues.apache.org/jira/browse/SOLR-1726 which comes into play if you expect to return a lot of rows. Solr really

RE: Short circuit AND or subquerying in lucene for performance

2012-02-15 Thread Uwe Schindler
> : Basically for queries such as field1:foo AND field2:*bar, I think it > : would be highly beneficial to restrict evaluation of the second field on > : the result of the first to avoid scanning the index in its entirety due > : to the leading wildcard. > > This is exactly how the BooleanQuery cl

Re: Short circuit AND or subquerying in lucene for performance

2012-02-15 Thread Chris Hostetter
: Basically for queries such as field1:foo AND field2:*bar, I think it : would be highly beneficial to restrict evaluation of the second field on : the result of the first to avoid scanning the index in its entirety due : to the leading wildcard. This is exactly how the BooleanQuery class in Luce

Indexing 100Gb of readonly numeric data

2012-02-15 Thread Pedro Ferreira
Hi guys, I hope I'm sending this to the right place. I have this possible idea in mind (still fuzzy, but enough to describe this), and I was wondering if Lucene or Solr could help in this. I've implemented a Lucene index on custom enterprise data before and have it running on Azure as well, so I

SweetSpotSimilarity

2012-02-15 Thread Peyman Faratin
Hi I have a noobie question. I am trying to use the SweetSpotSimilarity (SSS) class. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/contrib-misc/org/apache/lucene/misc/SweetSpotSimilarity.html I understand the scoring behavior of Lucene http://lucene.apache.org/core/old_ve

RE: Empty numeric field

2012-02-15 Thread Uwe Schindler
Hi again, I just have to remind that sorting on multi-valued fields is not supported by Lucene! This has nothing to do with numeric, it just does not work and may throw other exceptions depending on the version you use. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.th

Re: Empty numeric field

2012-02-15 Thread Christian Reuschling
Uwe, thank you very much. This sounds like the pretty best solution! 2012/2/15 Uwe Schindler : > Hi, > > Thanks for explanation. I almost expected that it has to do with" stored > fields". It's easy to fix: > >> ah ok, I know what you mean. We have to read out the stored field values >> later.

RE: Empty numeric field

2012-02-15 Thread Uwe Schindler
Hi, Thanks for explanation. I almost expected that it has to do with" stored fields". It's easy to fix: > ah ok, I know what you mean. We have to read out the stored field values > later. > A field can have multiple (stored) values (several > document.add(fieldable) invocations for one field).

Re: Empty numeric field

2012-02-15 Thread Christian Reuschling
ah ok, I know what you mean. We have to read out the stored field values later. A field can have multiple (stored) values (several document.add(fieldable) invocations for one field). Further, we have the problem that some field values are logically related to each other. Since Lucene has no possibi

RE: Empty numeric field

2012-02-15 Thread Uwe Schindler
Hi, This looks like an XY problem (http://www.perlmonks.org/index.pl?node_id=542341). Maybe you should first explain to us, why you need that. In Lucene fields have no "equal length" or something like that, especially numeric fields are tokenized and contain of several tokens separately indexe

Short circuit AND or subquerying in lucene for performance

2012-02-15 Thread Delalande, Thierry
Hi, I've been looking for a short circuit AND operator in Lucene or a way to do subquerying. Basically for queries such as field1:foo AND field2:*bar, I think it would be highly beneficial to restrict evaluation of the second field on the result of the first to avoid scanning the index in its

Re: Why read past EOF

2012-02-15 Thread Michael McCandless
Is your deletion policy actually deleting commits? Mike McCandless http://blog.mikemccandless.com On Wed, Feb 15, 2012 at 5:21 AM, superruiye wrote: > http://lucene.472066.n3.nabble.com/file/n3746464/index.jpg > > The index files are same size,and the index increase to 7.5G in one day,but > it

Re: Why read past EOF

2012-02-15 Thread superruiye
http://lucene.472066.n3.nabble.com/file/n3746464/index.jpg The index files are same size,and the index increase to 7.5G in one day,but it should only 90-100M... -- View this message in context: http://lucene.472066.n3.nabble.com/Why-read-past-EOF-tp3639401p3746464.html Sent from the Lucene - J

Re: effectiveness of compression

2012-02-15 Thread Li Li
for now lucene don't provide any thing like this. maybe you can diff each version before add them into index . so it just indexes and stores difference for newer version. On Wed, Feb 15, 2012 at 4:25 PM, Jamie wrote: > Greetings All. > > I'd like to index data corresponding to different versions

effectiveness of compression

2012-02-15 Thread Jamie
Greetings All. I'd like to index data corresponding to different versions of the same file. These files consists of PDF documents, word documents, and the like. So as to ensure that no information is lost, I'd like to create a new Lucene document for every version (or change) in a file. Each