Re: Index size and performance degradation

2011-06-16 Thread Denis Bazhenov
needs to skip large >> amount bytes to locate the exact location. In other words the search summary >> retrieval might slow. >> 3. It is really good for less number of concurrent users going to search at >> a time. >> >> Regards >> Ganesh >>

Re: Index size and performance degradation

2011-06-15 Thread Shai Erera
for less number of concurrent users going to search at > a time. > > Regards > Ganesh > > > > - Original Message - > From: "Ganesh" > To: ; > Sent: Tuesday, June 14, 2011 3:28 PM > Subject: Re: Index size and performance degradation > > &

Re: Index size and performance degradation

2011-06-15 Thread Ganesh
less number of concurrent users going to search at a time. Regards Ganesh - Original Message - From: "Ganesh" To: ; Sent: Tuesday, June 14, 2011 3:28 PM Subject: Re: Index size and performance degradation Is it a bad idea to keep multiple shards in a single system? Rega

Re: Index size and performance degradation

2011-06-14 Thread Michael McCandless
gt; On Tue, Jun 14, 2011 at 5:58 AM, Ganesh wrote: >> Is it a bad idea to keep multiple shards in a single system? >> >> Regards >> Ganesh >> >> - Original Message - >> From: "Toke Eskildsen" >> To: >> Sent: Tuesday, June 14

Re: Index size and performance degradation

2011-06-14 Thread Michael McCandless
al Message - > From: "Toke Eskildsen" > To: > Sent: Tuesday, June 14, 2011 12:58 PM > Subject: Re: Index size and performance degradation > > >> On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote: >>> The whole point of my question wa

Re: Index size and performance degradation

2011-06-14 Thread Ganesh
Is it a bad idea to keep multiple shards in a single system? Regards Ganesh - Original Message - From: "Toke Eskildsen" To: Sent: Tuesday, June 14, 2011 12:58 PM Subject: Re: Index size and performance degradation > On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko

Re: Index size and performance degradation

2011-06-14 Thread Stefan Trcek
On Sunday 12 June 2011 22:12:01 Michael McCandless wrote: > Anyway, I don't think that's a good tradeoff, in general, for our > users, because very few apps truly require immediate consistency from > Lucene (can anyone give an example where their app depends on > immediate consistency...? For data

Re: Index size and performance degradation

2011-06-14 Thread mark harwood
ations for that design choice. Cheers Mark - Original Message From: Itamar Syn-Hershko To: java-user@lucene.apache.org Sent: Tue, 14 June, 2011 9:03:15 Subject: Re: Index size and performance degradation Thanks. Our product is pretty generic and we can't assume much on the hardware,

Re: Index size and performance degradation

2011-06-14 Thread Itamar Syn-Hershko
Thanks. Our product is pretty generic and we can't assume much on the hardware, as well as on usage. Some users would want low latency, others will prefer throughput. My job is to make as little compromise as possible... As for SSD, thats generally a good advice, except they seem to be faili

Re: Index size and performance degradation

2011-06-14 Thread Ganesh
Ganesh - Original Message - From: "Shai Erera" To: Sent: Sunday, June 12, 2011 9:13 AM Subject: Re: Index size and performance degradation >I agree w/ Erick, there is no cutoff point (index size for that matter) > above which you start sharding. > > What

Re: Index size and performance degradation

2011-06-14 Thread Toke Eskildsen
On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote: > The whole point of my question was to find out if and how to make > balancing on the SAME machine. Apparently thats not going to help and at > a certain point we will just have to prompt the user to buy more hardware... It really dep

Re: Index size and performance degradation

2011-06-13 Thread Shai Erera
> > but you'll still cache the results - so again this isn't viable when RT > search, or even an NRT, is a requirement > No I don't cache the results. The Filter is an OpenBitSet of all docs that match the filter (e.g. have the specified language field's value) and it is refreshed whenever new seg

Re: Index size and performance degradation

2011-06-13 Thread Jason Rutherglen
> deletions made by readers merely mark it for > deletion, and once a doc has been marked for deletions it is deleted for all > intents and purposes, right? There's the point-in-timeness of a reader to consider. > Does the N in NRT represent only the cost of reopening a searcher? Aptly put, and

Re: Index size and performance degradation

2011-06-13 Thread Itamar Syn-Hershko
Since there should only be one writer, I'm not sure why you'd need transactional storage for that? deletions made by readers merely mark it for deletion, and once a doc has been marked for deletions it is deleted for all intents and purposes, right? But perhaps I need to refresh my memory on th

Re: Index size and performance degradation

2011-06-13 Thread Jason Rutherglen
> I don't think we'd do the post-filtering solution, but instead maybe > resolve the deletes "live" and store them in a transactional data I think Michael B. aptly described the sequence ID approach for 'live' deletes? On Mon, Jun 13, 2011 at 3:00 PM, Michael McCandless wrote: > Yes, adding dele

Re: Index size and performance degradation

2011-06-13 Thread Michael McCandless
Yes, adding deletes to Twitter's approach will be a challenge! I don't think we'd do the post-filtering solution, but instead maybe resolve the deletes "live" and store them in a transactional data structure of some kind... but even then we will pay a perf hit to lookup del docs against it. So, y

Re: Index size and performance degradation

2011-06-13 Thread Itamar Syn-Hershko
Thanks Mike, much appreciated. Wouldn't Twitter's approach fall for the exact same pit-hole you described Zoie does (or did) when it'll handle deletes too? I don't thing there is any other way of handling deletes other than post-filtering results. But perhaps the IW cache would be smaller tha

Re: Index size and performance degradation

2011-06-13 Thread Itamar Syn-Hershko
On 13/06/2011 06:23, Shai Erera wrote: A Language filter is one -- different users search in different languages and want to view pages in those languages only. If you have a field attach to your documents that identifies the language of the document, you can use it to filter the queries to retur

Re: Index size and performance degradation

2011-06-13 Thread Michael McCandless
Here's a blog post describing some details of Twitter's approach: http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html And here's a talk Michael did last October (Lucene Revolutions): http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-Wit

Re: Index size and performance degradation

2011-06-12 Thread Shai Erera
> > I'm not sure I understood the filters approach you described. Can you give > an example? > A Language filter is one -- different users search in different languages and want to view pages in those languages only. If you have a field attach to your documents that identifies the language of the

Re: Index size and performance degradation

2011-06-12 Thread Itamar Syn-Hershko
Thanks for your detailed answer. We'll have to tackle this and see whats more important to us then. I'd definitely love to hear Zoie has overcame all that... Any pointers to Michael Busch's approach? I take this has something to do with the core itself or index format, probably using the Flex

Re: Index size and performance degradation

2011-06-12 Thread Michael McCandless
>From what I understand of Zoie (and it's been some time since I last looked... so this could be wrong now), the biggest difference vs NRT is that Zoie aims for "immediate consistency", ie index changes are always made visible to the very next query, vs NRT which is "controlled consistency", a blen

Re: Index size and performance degradation

2011-06-12 Thread Itamar Syn-Hershko
Our problem is a bit different. There aren't always common searches so if we cache blindly we could end up having too much RAM allocated for virtually nothing. And we need to allow for real-time search so caching will hardly help. We enforce some client-side caching, but again - the real-time r

Re: Index size and performance degradation

2011-06-12 Thread Itamar Syn-Hershko
Mike, Speaking of NRT, and completely off-topic, I know: Lucene's NRT apparently isn't fast enough if Zoie was needed, and now that Zoie is around are there any plans to make it Lucene's default? or: why would one still use NRT when Zoie seem to work much better? Itamar. On 12/06/2011 13

Re: Index size and performance degradation

2011-06-12 Thread Shai Erera
> > Shai, what would you call a smart app-level cache? remembering frequent > searches and storing them handy? Remembering frequent searches is good. If you do this, you can warm up the cache whenever a new IndexSearcher is opened (e.g., if you use SearcherManager from LIA2) and besides keeping t

Re: Index size and performance degradation

2011-06-12 Thread Itamar Syn-Hershko
Andrew, no particular hardware setup I'm afraid. That is a general product which we can't assume anything about the hardware it would run on. Thanks for the tip on multi-core tho. On 12/06/2011 11:45, Andrew Kane wrote: In the literature there is some evidence that sharding of in-memory index

Re: Index size and performance degradation

2011-06-12 Thread Itamar Syn-Hershko
Shai, what would you call a smart app-level cache? remembering frequent searches and storing them handy? or are there more advanced techniques for that? any pointers appreciated... Thanks for all the advice! On 12/06/2011 11:42, Shai Erera wrote: isn't there anything that we can do to avoi

Re: Index size and performance degradation

2011-06-12 Thread Michael McCandless
Remember that memory-mapping is not a panacea: at the end of the day, if there just isn't enough RAM on the machine to keep your full "working set" hot, then the OS will have to hit the disk, regardless of whether the access is through MMap or a "traditional" IO request. That said, on Fedora Linux

Re: Index size and performance degradation

2011-06-12 Thread Andrew Kane
In the literature there is some evidence that sharding of in-memory indexes on multi-core machines might be better. Has anyone tried this lately? http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4228359 Single disk machines (HDD or SSD) would be slower. Multi-disk or RAID type setups

Re: Index size and performance degradation

2011-06-12 Thread Shai Erera
> > isn't there anything that we can do to avoid that? > That was my point :) --> you can optimize your search application, use mmap files, smart caches etc., until it reaches a point where you need to shard. But it's still application dependent, not much of an OS thing. You can count on the OS to

Re: Index size and performance degradation

2011-06-12 Thread Itamar Syn-Hershko
Thanks. The whole point of my question was to find out if and how to make balancing on the SAME machine. Apparently thats not going to help and at a certain point we will just have to prompt the user to buy more hardware... Out of curiosity, isn't there anything that we can do to avoid that

Re: Index size and performance degradation

2011-06-11 Thread Shai Erera
I agree w/ Erick, there is no cutoff point (index size for that matter) above which you start sharding. What you can do is create a scheduled job in your system that runs a select list of queries and monitors their performance. Once it degrades, it shards the index by either splitting it (you can

Re: Index size and performance degradation

2011-06-11 Thread Erick Erickson
<<>> Hmmm, then it's pretty hopeless I think. Problem is that anything you say about running on a machine with 2G available memory on a single processor is completely incomparable to running on a machine with 64G of memory available for Lucene and 16 processors. There's really no such thing as an

Index size and performance degradation

2011-06-11 Thread Itamar Syn-Hershko
Hi all, I know Lucene indexes to be at their optimum up to a certain size - said to be around several GBs. I haven't found a good discussion over this, but its my understanding that at some point its better to split an index into parts (a la sharding) than to continue searching on a huge-size