Yes, adding deletes to Twitter's approach will be a challenge! I don't think we'd do the post-filtering solution, but instead maybe resolve the deletes "live" and store them in a transactional data structure of some kind... but even then we will pay a perf hit to lookup del docs against it.
So, yeah, there will presumably be a tradeoff with this approach too. However, turning around changes from the adds should be faster (no segment gets flushed). Mike McCandless http://blog.mikemccandless.com On Mon, Jun 13, 2011 at 5:06 PM, Itamar Syn-Hershko <ita...@code972.com> wrote: > Thanks Mike, much appreciated. > > > Wouldn't Twitter's approach fall for the exact same pit-hole you described > Zoie does (or did) when it'll handle deletes too? I don't thing there is any > other way of handling deletes other than post-filtering results. But perhaps > the IW cache would be smaller than Zoie's RAMDirectory(ies)? > > > I'll give all that a serious dive and report back with results or if more > input will be required... > > > Itamar. > > > On 13/06/2011 19:01, Michael McCandless wrote: > >> Here's a blog post describing some details of Twitter's approach: >> >> >> http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html >> >> And here's a talk Michael did last October (Lucene Revolutions): >> >> >> http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter >> >> Twitter's case is simpler since they never delete ;) So we have to >> fix that to do it in Lucene... there are also various open issues that >> begin to explore some of the ideas here. >> >> But this ("immediate consistency") would be a deep and complex change, >> and I don't see many apps that actually require it. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko<ita...@code972.com> >> wrote: >>> >>> Thanks for your detailed answer. We'll have to tackle this and see whats >>> more important to us then. I'd definitely love to hear Zoie has overcame >>> all >>> that... >>> >>> >>> Any pointers to Michael Busch's approach? I take this has something to do >>> with the core itself or index format, probably using the Flex version? >>> >>> >>> Itamar. >>> >>> >>> On 12/06/2011 23:12, Michael McCandless wrote: >>> >>>>> From what I understand of Zoie (and it's been some time since I last >>>> >>>> looked... so this could be wrong now), the biggest difference vs NRT >>>> is that Zoie aims for "immediate consistency", ie index changes are >>>> always made visible to the very next query, vs NRT which is >>>> "controlled consistency", a blend between immediate and eventual >>>> consistency where your app decides when the changes must become >>>> visible. >>>> >>>> But in exchange for that, Zoie pays a price: each search has a higher >>>> cost per collected hit, since it must post-filter for deleted docs. >>>> And since Zoie necessarily adds complexity, there's more risk; eg >>>> there were some nasty Zoie bugs that took quite some time to track >>>> down (under https://issues.apache.org/jira/browse/LUCENE-2729). >>>> >>>> Anyway, I don't think that's a good tradeoff, in general, for our >>>> users, because very few apps truly require immediate consistency from >>>> Lucene (can anyone give an example where their app depends on >>>> immediate consistency...?). I think it's better to spend time during >>>> reopen so that searches aren't slower. >>>> >>>> That said, Lucene has already incorporated one big part of Zoie >>>> (caching small segments in RAM) via the new NRTCachingDirectory (in >>>> contrib/misc). Also, the upcoming NRTManager >>>> (https://issues.apache.org/jira/browse/LUCENE-2955) adds control over >>>> visibility of specific indexing changes to queries that need to see >>>> the changes. >>>> >>>> Finally, even better would be to not have to make any tradeoff >>>> whatsoever ;) Twitter's approach (created by Michael Busch) seems to >>>> bring immediate consistency with no search performance hit, so if we >>>> do anything here likely it'll be similar to what Michael has done >>>> (though, those changes are not simple either!). >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko<ita...@code972.com> >>>> wrote: >>>>> >>>>> Mike, >>>>> >>>>> >>>>> Speaking of NRT, and completely off-topic, I know: Lucene's NRT >>>>> apparently >>>>> isn't fast enough if Zoie was needed, and now that Zoie is around are >>>>> there >>>>> any plans to make it Lucene's default? or: why would one still use NRT >>>>> when >>>>> Zoie seem to work much better? >>>>> >>>>> >>>>> Itamar. >>>>> >>>>> >>>>> On 12/06/2011 13:16, Michael McCandless wrote: >>>>> >>>>>> Remember that memory-mapping is not a panacea: at the end of the day, >>>>>> if there just isn't enough RAM on the machine to keep your full >>>>>> "working set" hot, then the OS will have to hit the disk, regardless >>>>>> of whether the access is through MMap or a "traditional" IO request. >>>>>> >>>>>> That said, on Fedora Linux anyway, I generally see better performance >>>>>> from MMap than from NIOFSDir; eg see the 2nd chart here: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html >>>>>> >>>>>> Mike McCandless >>>>>> >>>>>> http://blog.mikemccandless.com >>>>>> >>>>>> On Sun, Jun 12, 2011 at 4:10 AM, Itamar >>>>>> Syn-Hershko<ita...@code972.com> >>>>>> wrote: >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> >>>>>>> The whole point of my question was to find out if and how to make >>>>>>> balancing >>>>>>> on the SAME machine. Apparently thats not going to help and at a >>>>>>> certain >>>>>>> point we will just have to prompt the user to buy more hardware... >>>>>>> >>>>>>> >>>>>>> Out of curiosity, isn't there anything that we can do to avoid that? >>>>>>> for >>>>>>> instance using memory-mapped files for the indexes? anything that >>>>>>> would >>>>>>> help >>>>>>> us overcome OS limitations of that sort... >>>>>>> >>>>>>> >>>>>>> Also, you mention a scheduled job to check for performance >>>>>>> degradation; >>>>>>> any >>>>>>> idea how serious such a drop should be for sharding to be really >>>>>>> beneficial? >>>>>>> or is it application specific too? >>>>>>> >>>>>>> >>>>>>> Itamar. >>>>>>> >>>>>>> >>>>>>> On 12/06/2011 06:43, Shai Erera wrote: >>>>>>> >>>>>>>> I agree w/ Erick, there is no cutoff point (index size for that >>>>>>>> matter) >>>>>>>> above which you start sharding. >>>>>>>> >>>>>>>> What you can do is create a scheduled job in your system that runs a >>>>>>>> select >>>>>>>> list of queries and monitors their performance. Once it degrades, it >>>>>>>> shards >>>>>>>> the index by either splitting it (you can use IndexSplitter under >>>>>>>> contrib) >>>>>>>> or create a new shard, and direct new documents to it. >>>>>>>> >>>>>>>> I think I read somewhere, not sure if it was in Solr or >>>>>>>> ElasticSearch >>>>>>>> documentation, about a Balancer object, which moves shards around in >>>>>>>> order >>>>>>>> to balance the load on the cluster. You can implement something >>>>>>>> similar >>>>>>>> which tries to balance the index sizes, creates new shards >>>>>>>> on-the-fly, >>>>>>>> even >>>>>>>> merge shards if suddenly a whole source is being removed from the >>>>>>>> system >>>>>>>> etc. >>>>>>>> >>>>>>>> Also, note that the 'largest index size' threshold is really a >>>>>>>> machine >>>>>>>> constraint and not Lucene's. So if you decide that 10 GB is your >>>>>>>> cutoff, >>>>>>>> it >>>>>>>> is pointless to create 10x10GB shards on the same machine -- >>>>>>>> searching >>>>>>>> them >>>>>>>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps >>>>>>>> it's >>>>>>>> even >>>>>>>> worse because you consume more RAM when the indexes are split (e.g., >>>>>>>> terms >>>>>>>> index, field infos etc.). >>>>>>>> >>>>>>>> Shai >>>>>>>> >>>>>>>> On Sun, Jun 12, 2011 at 3:10 AM, Erick >>>>>>>> Erickson<erickerick...@gmail.com>wrote: >>>>>>>> >>>>>>>>> <<<We can't assume anything about the machine running it, >>>>>>>>> so testing won't really tell us much>>> >>>>>>>>> >>>>>>>>> Hmmm, then it's pretty hopeless I think. Problem is that >>>>>>>>> anything you say about running on a machine with >>>>>>>>> 2G available memory on a single processor is completely >>>>>>>>> incomparable to running on a machine with 64G of >>>>>>>>> memory available for Lucene and 16 processors. >>>>>>>>> >>>>>>>>> There's really no such thing as an "optimum" Lucene index >>>>>>>>> size, it always relates to the characteristics of the >>>>>>>>> underlying hardware. >>>>>>>>> >>>>>>>>> I think the best you can do is actually test on various >>>>>>>>> configurations, then at least you can say "on configuration >>>>>>>>> X this is the tipping point". >>>>>>>>> >>>>>>>>> Sorry there isn't a better answer that I know of, but... >>>>>>>>> >>>>>>>>> Best >>>>>>>>> Erick >>>>>>>>> >>>>>>>>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar >>>>>>>>> Syn-Hershko<ita...@code972.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> I know Lucene indexes to be at their optimum up to a certain size >>>>>>>>>> - >>>>>>>>>> said >>>>>>>>> >>>>>>>>> to >>>>>>>>>> >>>>>>>>>> be around several GBs. I haven't found a good discussion over >>>>>>>>>> this, >>>>>>>>>> but >>>>>>>>> >>>>>>>>> its >>>>>>>>>> >>>>>>>>>> my understanding that at some point its better to split an index >>>>>>>>>> into >>>>>>>>> >>>>>>>>> parts >>>>>>>>>> >>>>>>>>>> (a la sharding) than to continue searching on a huge-size index. I >>>>>>>>>> assume >>>>>>>>>> this has to do with OS and IO configurations. Can anyone point me >>>>>>>>>> to >>>>>>>>>> more >>>>>>>>>> info on this? >>>>>>>>>> >>>>>>>>>> We have a product that is using Lucene for various searches, and >>>>>>>>>> at >>>>>>>>>> the >>>>>>>>>> moment each type of search is using its own Lucene index. We plan >>>>>>>>>> on >>>>>>>>>> refactoring the way it works and to combine all indexes into one - >>>>>>>>>> making >>>>>>>>>> the whole system more robust and with a smaller memory footprint, >>>>>>>>>> among >>>>>>>>>> other things. >>>>>>>>>> >>>>>>>>>> Assuming the above is true, we are interested in knowing how to do >>>>>>>>>> this >>>>>>>>>> correctly. Initially all our indexes will be run in one big index, >>>>>>>>>> but >>>>>>>>>> if >>>>>>>>> >>>>>>>>> at >>>>>>>>>> >>>>>>>>>> some index size there is a severe performance degradation we would >>>>>>>>>> like >>>>>>>>> >>>>>>>>> to >>>>>>>>>> >>>>>>>>>> handle that correctly by starting a new FSDirectory index to flush >>>>>>>>>> into, >>>>>>>>> >>>>>>>>> or >>>>>>>>>> >>>>>>>>>> by re-indexing and moving large indexes into their own Lucene >>>>>>>>>> index. >>>>>>>>>> >>>>>>>>>> Are there are any guidelines for measuring or estimating this >>>>>>>>>> correctly? >>>>>>>>>> what we should be aware of while considering all that? We can't >>>>>>>>>> assume >>>>>>>>>> anything about the machine running it, so testing won't really >>>>>>>>>> tell >>>>>>>>>> us >>>>>>>>>> much... >>>>>>>>>> >>>>>>>>>> Thanks in advance for any input on this, >>>>>>>>>> >>>>>>>>>> Itamar. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>>>> >>>>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>> >>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>> >>>>>> >>>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org