Re: Index size and performance degradation

Michael McCandless Mon, 13 Jun 2011 15:01:38 -0700

Yes, adding deletes to Twitter's approach will be a challenge!

I don't think we'd do the post-filtering solution, but instead maybe
resolve the deletes "live" and store them in a transactional data
structure of some kind... but even then we will pay a perf hit to
lookup del docs against it.


So, yeah, there will presumably be a tradeoff with this approach too.
However, turning around changes from the adds should be faster (no
segment gets flushed).

Mike McCandless

http://blog.mikemccandless.com

On Mon, Jun 13, 2011 at 5:06 PM, Itamar Syn-Hershko <ita...@code972.com> wrote:
> Thanks Mike, much appreciated.
>
>
> Wouldn't Twitter's approach fall for the exact same pit-hole you described
> Zoie does (or did) when it'll handle deletes too? I don't thing there is any
> other way of handling deletes other than post-filtering results. But perhaps
> the IW cache would be smaller than Zoie's RAMDirectory(ies)?
>
>
> I'll give all that a serious dive and report back with results or if more
> input will be required...
>
>
> Itamar.
>
>
> On 13/06/2011 19:01, Michael McCandless wrote:
>
>> Here's a blog post describing some details of Twitter's approach:
>>
>>
>> http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html
>>
>> And here's a talk Michael did last October (Lucene Revolutions):
>>
>>
>> http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter
>>
>> Twitter's case is simpler since they never delete ;)  So we have to
>> fix that to do it in Lucene... there are also various open issues that
>> begin to explore some of the ideas here.
>>
>> But this ("immediate consistency") would be a deep and complex change,
>> and I don't see many apps that actually require it.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko<ita...@code972.com>
>>  wrote:
>>>
>>> Thanks for your detailed answer. We'll have to tackle this and see whats
>>> more important to us then. I'd definitely love to hear Zoie has overcame
>>> all
>>> that...
>>>
>>>
>>> Any pointers to Michael Busch's approach? I take this has something to do
>>> with the core itself or index format, probably using the Flex version?
>>>
>>>
>>> Itamar.
>>>
>>>
>>> On 12/06/2011 23:12, Michael McCandless wrote:
>>>
>>>>>  From what I understand of Zoie (and it's been some time since I last
>>>>
>>>> looked... so this could be wrong now), the biggest difference vs NRT
>>>> is that Zoie aims for "immediate consistency", ie index changes are
>>>> always made visible to the very next query, vs NRT which is
>>>> "controlled consistency", a blend between immediate and eventual
>>>> consistency where your app decides when the changes must become
>>>> visible.
>>>>
>>>> But in exchange for that, Zoie pays a price: each search has a higher
>>>> cost per collected hit, since it must post-filter for deleted docs.
>>>> And since Zoie necessarily adds complexity, there's more risk; eg
>>>> there were some nasty Zoie bugs that took quite some time to track
>>>> down (under https://issues.apache.org/jira/browse/LUCENE-2729).
>>>>
>>>> Anyway, I don't think that's a good tradeoff, in general, for our
>>>> users, because very few apps truly require immediate consistency from
>>>> Lucene (can anyone give an example where their app depends on
>>>> immediate consistency...?).  I think it's better to spend time during
>>>> reopen so that searches aren't slower.
>>>>
>>>> That said, Lucene has already incorporated one big part of Zoie
>>>> (caching small segments in RAM) via the new NRTCachingDirectory (in
>>>> contrib/misc).  Also, the upcoming NRTManager
>>>> (https://issues.apache.org/jira/browse/LUCENE-2955) adds control over
>>>> visibility of specific indexing changes to queries that need to see
>>>> the changes.
>>>>
>>>> Finally, even better would be to not have to make any tradeoff
>>>> whatsoever ;)  Twitter's approach (created by Michael Busch) seems to
>>>> bring immediate consistency with no search performance hit, so if we
>>>> do anything here likely it'll be similar to what Michael has done
>>>> (though, those changes are not simple either!).
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko<ita...@code972.com>
>>>>  wrote:
>>>>>
>>>>> Mike,
>>>>>
>>>>>
>>>>> Speaking of NRT, and completely off-topic, I know: Lucene's NRT
>>>>> apparently
>>>>> isn't fast enough if Zoie was needed, and now that Zoie is around are
>>>>> there
>>>>> any plans to make it Lucene's default? or: why would one still use NRT
>>>>> when
>>>>> Zoie seem to work much better?
>>>>>
>>>>>
>>>>> Itamar.
>>>>>
>>>>>
>>>>> On 12/06/2011 13:16, Michael McCandless wrote:
>>>>>
>>>>>> Remember that memory-mapping is not a panacea: at the end of the day,
>>>>>> if there just isn't enough RAM on the machine to keep your full
>>>>>> "working set" hot, then the OS will have to hit the disk, regardless
>>>>>> of whether the access is through MMap or a "traditional" IO request.
>>>>>>
>>>>>> That said, on Fedora Linux anyway, I generally see better performance
>>>>>> from MMap than from NIOFSDir; eg see the 2nd chart here:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>
>>>>>> On Sun, Jun 12, 2011 at 4:10 AM, Itamar
>>>>>> Syn-Hershko<ita...@code972.com>
>>>>>>  wrote:
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>> The whole point of my question was to find out if and how to make
>>>>>>> balancing
>>>>>>> on the SAME machine. Apparently thats not going to help and at a
>>>>>>> certain
>>>>>>> point we will just have to prompt the user to buy more hardware...
>>>>>>>
>>>>>>>
>>>>>>> Out of curiosity, isn't there anything that we can do to avoid that?
>>>>>>> for
>>>>>>> instance using memory-mapped files for the indexes? anything that
>>>>>>> would
>>>>>>> help
>>>>>>> us overcome OS limitations of that sort...
>>>>>>>
>>>>>>>
>>>>>>> Also, you mention a scheduled job to check for performance
>>>>>>> degradation;
>>>>>>> any
>>>>>>> idea how serious such a drop should be for sharding to be really
>>>>>>> beneficial?
>>>>>>> or is it application specific too?
>>>>>>>
>>>>>>>
>>>>>>> Itamar.
>>>>>>>
>>>>>>>
>>>>>>> On 12/06/2011 06:43, Shai Erera wrote:
>>>>>>>
>>>>>>>> I agree w/ Erick, there is no cutoff point (index size for that
>>>>>>>> matter)
>>>>>>>> above which you start sharding.
>>>>>>>>
>>>>>>>> What you can do is create a scheduled job in your system that runs a
>>>>>>>> select
>>>>>>>> list of queries and monitors their performance. Once it degrades, it
>>>>>>>> shards
>>>>>>>> the index by either splitting it (you can use IndexSplitter under
>>>>>>>> contrib)
>>>>>>>> or create a new shard, and direct new documents to it.
>>>>>>>>
>>>>>>>> I think I read somewhere, not sure if it was in Solr or
>>>>>>>> ElasticSearch
>>>>>>>> documentation, about a Balancer object, which moves shards around in
>>>>>>>> order
>>>>>>>> to balance the load on the cluster. You can implement something
>>>>>>>> similar
>>>>>>>> which tries to balance the index sizes, creates new shards
>>>>>>>> on-the-fly,
>>>>>>>> even
>>>>>>>> merge shards if suddenly a whole source is being removed from the
>>>>>>>> system
>>>>>>>> etc.
>>>>>>>>
>>>>>>>> Also, note that the 'largest index size' threshold is really a
>>>>>>>> machine
>>>>>>>> constraint and not Lucene's. So if you decide that 10 GB is your
>>>>>>>> cutoff,
>>>>>>>> it
>>>>>>>> is pointless to create 10x10GB shards on the same machine --
>>>>>>>> searching
>>>>>>>> them
>>>>>>>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps
>>>>>>>> it's
>>>>>>>> even
>>>>>>>> worse because you consume more RAM when the indexes are split (e.g.,
>>>>>>>> terms
>>>>>>>> index, field infos etc.).
>>>>>>>>
>>>>>>>> Shai
>>>>>>>>
>>>>>>>> On Sun, Jun 12, 2011 at 3:10 AM, Erick
>>>>>>>> Erickson<erickerick...@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> <<<We can't assume anything about the machine running it,
>>>>>>>>> so testing won't really tell us much>>>
>>>>>>>>>
>>>>>>>>> Hmmm, then it's pretty hopeless I think. Problem is that
>>>>>>>>> anything you say about running on a machine with
>>>>>>>>> 2G available memory on a single processor is completely
>>>>>>>>> incomparable to running on a machine with 64G of
>>>>>>>>> memory available for Lucene and 16 processors.
>>>>>>>>>
>>>>>>>>> There's really no such thing as an "optimum" Lucene index
>>>>>>>>> size, it always relates to the characteristics of the
>>>>>>>>> underlying hardware.
>>>>>>>>>
>>>>>>>>> I think the best you can do is actually test on various
>>>>>>>>> configurations, then at least you can say "on configuration
>>>>>>>>> X this is the tipping point".
>>>>>>>>>
>>>>>>>>> Sorry there isn't a better answer that I know of, but...
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>> Erick
>>>>>>>>>
>>>>>>>>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar
>>>>>>>>> Syn-Hershko<ita...@code972.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I know Lucene indexes to be at their optimum up to a certain size
>>>>>>>>>> -
>>>>>>>>>> said
>>>>>>>>>
>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>> be around several GBs. I haven't found a good discussion over
>>>>>>>>>> this,
>>>>>>>>>> but
>>>>>>>>>
>>>>>>>>> its
>>>>>>>>>>
>>>>>>>>>> my understanding that at some point its better to split an index
>>>>>>>>>> into
>>>>>>>>>
>>>>>>>>> parts
>>>>>>>>>>
>>>>>>>>>> (a la sharding) than to continue searching on a huge-size index. I
>>>>>>>>>> assume
>>>>>>>>>> this has to do with OS and IO configurations. Can anyone point me
>>>>>>>>>> to
>>>>>>>>>> more
>>>>>>>>>> info on this?
>>>>>>>>>>
>>>>>>>>>> We have a product that is using Lucene for various searches, and
>>>>>>>>>> at
>>>>>>>>>> the
>>>>>>>>>> moment each type of search is using its own Lucene index. We plan
>>>>>>>>>> on
>>>>>>>>>> refactoring the way it works and to combine all indexes into one -
>>>>>>>>>> making
>>>>>>>>>> the whole system more robust and with a smaller memory footprint,
>>>>>>>>>> among
>>>>>>>>>> other things.
>>>>>>>>>>
>>>>>>>>>> Assuming the above is true, we are interested in knowing how to do
>>>>>>>>>> this
>>>>>>>>>> correctly. Initially all our indexes will be run in one big index,
>>>>>>>>>> but
>>>>>>>>>> if
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>>>
>>>>>>>>>> some index size there is a severe performance degradation we would
>>>>>>>>>> like
>>>>>>>>>
>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>> handle that correctly by starting a new FSDirectory index to flush
>>>>>>>>>> into,
>>>>>>>>>
>>>>>>>>> or
>>>>>>>>>>
>>>>>>>>>> by re-indexing and moving large indexes into their own Lucene
>>>>>>>>>> index.
>>>>>>>>>>
>>>>>>>>>> Are there are any guidelines for measuring or estimating this
>>>>>>>>>> correctly?
>>>>>>>>>> what we should be aware of while considering all that? We can't
>>>>>>>>>> assume
>>>>>>>>>> anything about the machine running it, so testing won't really
>>>>>>>>>> tell
>>>>>>>>>> us
>>>>>>>>>> much...
>>>>>>>>>>
>>>>>>>>>> Thanks in advance for any input on this,
>>>>>>>>>>
>>>>>>>>>> Itamar.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Index size and performance degradation

Reply via email to