Re: Relevance ranking calculation based on filtered document count

2013-07-01 Thread Nigel V Thomas
interesting exploits, particularly in the context of securing the search space by filtering search results. Nigel On 1 July 2013 13:09, Jack Krupansky wrote: > The very definition of a "filter" in Lucene is that it doesn't influence > relevance/scoring in any way, so your question

Relevance ranking calculation based on filtered document count

2013-07-01 Thread Nigel V Thomas
values are used to rank documents, the resulting ordering is different from a set whose range is restricted only to the filtered set of documents. Many thanks, Nigel

Re: Indexing file with security problem

2013-06-26 Thread Nigel V Thomas
of any suitable solutions yet. Nigel V Thomas On 26 June 2013 20:42, lukasw wrote: > Hello > > I'll try to briefly describe my problem and task. > My name is Lukas and i am Java developer , my task is to create search > engine for different types of file (only text file types)

Largest Lucene installation?

2010-08-26 Thread Nigel
I'm curious about what the largest Lucene installations are, in terms of: - Greatest number of documents (i.e. X billion docs) - Largest data size (i.e. Y terabytes of indexes) - Most machines (i.e. Z shards or severs) Apart from general curiosity, the obvious follow-up question would be what app

Re: Will doc ids ever change if nothing is deleted?

2010-05-14 Thread Nigel
lication > site: http://www.dbsight.net > demo: http://search.dbsight.com > Lucene Database Search in 3 minutes: > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes > DBSight customer, a shopping comparison site, (anonymous per request) got > 2

Re: Will doc ids ever change if nothing is deleted?

2010-05-14 Thread Nigel
c), and using those identifiers, > rather than internal Lucene docids, to track documents between the > search stage and the loading stage? > > On Thu, May 13, 2010 at 7:12 PM, Nigel wrote: > > Yes, I realize that storing document IDs persistently (for example) is a > Bad >

Re: Will doc ids ever change if nothing is deleted?

2010-05-13 Thread Nigel
Ds. If you need an invariant document ID, assign it yourself. > > If this is off base, could you supply your use-case? > > Best > Erick > > On Thu, May 13, 2010 at 9:38 PM, Nigel wrote: > > > The FAQ clearly states that document IDs will not be re-assigned unless &

Will doc ids ever change if nothing is deleted?

2010-05-13 Thread Nigel
The FAQ clearly states that document IDs will not be re-assigned unless something was deleted. http://wiki.apache.org/lucene-java/LuceneFAQ#When_is_it_possible_for_document_IDs_to_change.3F However, a number of other emails and posts I've read mention that renumbering occurs when segments are merg

Re: Fields with cardinality = 1?

2010-03-08 Thread Nigel
Thanks, Mike -- that makes sense. Yes, the fields would be known in advance so the codec would know to ignore them at index time. Thanks, Chris

Fields with cardinality = 1?

2010-03-07 Thread Nigel
Does Lucene have any special optimization for a field that has the same value for all documents in the index? For example, rather than storing a list of all doc ids for the single term, it could in theory note this special case and not save any ids for that field. (You might well ask what the poi

Scanning docs at index time

2010-02-22 Thread Nigel
I'd like to scan documents as they're being indexed, to find out immediately if any of them match certain queries. The goal is to find out of there are any new hits for these queries as soon as possible, without re-searching the index over and over (which would be inefficient, and higher latency).

Re: Index file compatibility and a migration plan to lucene 3

2009-12-10 Thread Nigel
I have a follow-up question to this thread on Field.Store.COMPRESS in 2.9.1 and beyond. I'm getting a bit confused between the changes in 2.9.1 and 3.0 so I want to make sure I know what's going on. We also use old-style compressed fields and are about to upgrade to 2.9.1. Is the following accur

Re: Efficiently reopening remotely-distributed indexes in 2.9?

2009-10-09 Thread Nigel
Got it -- thanks, Mark! (Recently I read elsewhere in the archives of this list about the value or lack thereof of segments.gen, so skipping that file was in the back of my mind as well.) Chris On Thu, Oct 8, 2009 at 3:04 PM, Mark Miller wrote: > Nigel wrote: > > Thanks, Mark. T

Re: Efficiently reopening remotely-distributed indexes in 2.9?

2009-10-08 Thread Nigel
gt; once, so its not much different than what happens locally. > > Nigel wrote: > > Right now we logically re-open an index by making an updated copy of the > > index in a new directory (using rsync etc.), opening the new copy, and > > closing the old one. We don't u

Re: Efficiently reopening remotely-distributed indexes in 2.9?

2009-10-07 Thread Nigel
)? What does Solr do? Thanks, Chris On Mon, Oct 5, 2009 at 8:39 PM, Jason Rutherglen wrote: > I'm not sure I understand the question. You're trying to reopen > the segments that you're replicated and you're wondering what's > changed in Lucene? > > On Mo

Re: Efficiently reopening remotely-distributed indexes in 2.9?

2009-10-05 Thread Nigel
Anyone have any ideas here? I imagine a lot of other people will have a similar question when trying to take advantage of the reopen improvements in 2.9. Thanks, Chris On Thu, Oct 1, 2009 at 5:15 PM, Nigel wrote: > I have a question about the reopen functionality in Lucene 2.9. A

Efficiently reopening remotely-distributed indexes in 2.9?

2009-10-01 Thread Nigel
I have a question about the reopen functionality in Lucene 2.9. As I understand it, since FieldCaches are now per-segment, it can avoid reloading everything when the index is reopened, and instead just load the new segments. For background, like many people we have a distributed architecture wher

Re: Efficient optimization of large indexes?

2009-08-11 Thread Nigel
Mike, thanks very much for your comments! I won't have time to try these ideas for a little while but when I do I'll definitely post the results. On Fri, Aug 7, 2009 at 12:15 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Thu, Aug 6, 2009 at 5:30 PM, Nigel w

Re: Efficient optimization of large indexes?

2009-08-06 Thread Nigel
On Wed, Aug 5, 2009 at 3:50 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Wed, Aug 5, 2009 at 12:08 PM, Nigel wrote: > > We periodically optimize large indexes (100 - 200gb) by calling > > IndexWriter.optimize(). It takes a heck of a long time, and I'm

Efficient optimization of large indexes?

2009-08-05 Thread Nigel
We periodically optimize large indexes (100 - 200gb) by calling IndexWriter.optimize(). It takes a heck of a long time, and I'm wondering if a more efficient solution might be the following: - Create a new empty index on a different filesystem - Set a merge policy for the new index so it puts eve

Re: Optimizing unordered queries

2009-07-08 Thread Nigel
I created a benchmark test using real queries from our logs. I kept the LRU cache the same for now and varied the index divisor: index divisor = 1: 768 sec. index divisor = 4: 788 sec. (+ 3%) index divisor = 8: 855 sec. (+ 11%) index divisor = 16: 997 sec. (+ 30%) This is exciting news for me, a

Re: Optimizing unordered queries

2009-07-06 Thread Nigel
On Mon, Jul 6, 2009 at 12:37 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Mon, Jun 29, 2009 at 9:33 AM, Nigel wrote: > > > Ah, I was confused by the index divisor being 1 by default: I thought it > > meant that all terms were being loaded. I see now i

Re: Optimizing unordered queries

2009-06-29 Thread Nigel
On Mon, Jun 29, 2009 at 6:28 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Sun, Jun 28, 2009 at 9:08 PM, Nigel wrote: > >> Unfortunately the TermInfos must still be hit to look up the > >> freq/proxOffset in the postings files. > > > > But

Re: Optimizing unordered queries

2009-06-28 Thread Nigel
On Fri, Jun 26, 2009 at 11:06 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Thu, Jun 25, 2009 at 10:11 PM, Nigel wrote: > > > Currently we're (perhaps naively) doing the equivalent of > > query.weight(searcher).scorer(reader).score(collector). Obvi

Re: Optimizing unordered queries

2009-06-28 Thread Nigel
On Fri, Jun 26, 2009 at 10:52 AM, eks dev wrote: > > also see, > http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/BooleanQuery.html#getAllowDocsOutOfOrder() Interesti

Re: Optimizing unordered queries

2009-06-28 Thread Nigel
On Fri, Jun 26, 2009 at 10:51 AM, eks dev wrote: > > You omitNorms(), did you also omitTf()? We did, but had to include TF after all since omitting it also dropped position information, which we needed for phrase queries. I didn't think it was possible to remove just the TFs without the positi

Optimizing unordered queries

2009-06-25 Thread Nigel
I recently posted some questions about performance problems with large indexes. One key thing about our situation is that we don't need sorted results (either by relevance or any other key). I've been looking into our memory usage and tracing through some code, which in combination with the recen

Re: Analyzing performance and memory consumption for boolean queries

2009-06-25 Thread Nigel
On Wed, Jun 24, 2009 at 4:47 PM, Uwe Schindler wrote: > Have you tried out, if GC affects you? A first step would be to turn on GC > logging with -verbosegc -XX:+PrintGCDetails > > If you see some relation between query time and gc messages, you should try > to use a better parallelized GC and ch

Re: Setting swappiness

2009-06-24 Thread Nigel
This is interesting, and counter-intuitive: more queries could actually improve overall performance. The big-index-and-slow-query-rate does describe our situation. I'll try running some tests that run queries at various rates concurrent with occasional big I/O operations that use the disk cache.

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread Nigel
Hi Mike, Yes, we're indexing on a separate server, and rsyncing from index snapshots there to the search servers. Usually rsync has to copy just a few small .cfs files, but every once in a while merging will product a big one. I'm going to try to limit this by setting maxMergeMB, but of course t

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread Nigel
Hi Uwe, Good points, thank you. The obvious place where GC really has to work hard is when index changes are rsync'd over and we have to open the new index and close the old one. Our slow performance times don't seem to be directly correlated with the index rotation, but maybe it just appears th

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread Nigel
of memory, but it can't be everything, otherwise OS caching would have no effect. Thanks, Chris On Tue, Jun 23, 2009 at 11:16 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > > Nigel, > > Based on the description, I'd suspect unnecessarily(?) large JVM heap

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread Nigel
Hi Ken, Thanks for your reply. I agree that your overall diagnosis (GC problems and/or swapping) sounds likely. To follow up on some the specific things you mentioned: 2. 250M/4 = 60M docs/index. The old rule of thumb was 10M docs/index as a > reasonable size. You might just need more hardware.

Analyzing performance and memory consumption for boolean queries

2009-06-23 Thread Nigel
Our query performance is surprisingly inconsistent, and I'm trying to figure out why. I've realized that I need to better understand what's going on internally in Lucene when we're searching. I'd be grateful for any answers (including pointers to existing docs, if any). Our situation is this: We