Re: MapReduce, Timeouts and Range Batch Size

Jonathan Ellis Mon, 26 Apr 2010 07:16:31 -0700

OPP will be marginally faster.  Maybe 10%?  I don't think anyone has
benchmarked it.


On Fri, Apr 23, 2010 at 10:30 AM, Joost Ouwerkerk <jo...@openplaces.org> wrote:
> In that case I should probably wait for 0.7.  Is there any fundamental
> performance difference in get_range_slices between Random and
> Order-Preserving partitioners.  If so, by what factor?
> joost.
>
> On Fri, Apr 23, 2010 at 10:47 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>
>> You could look into it, but it's not going to be an easy backport
>> since SSTableReader and SSTableScanner got split into two classes in
>> trunk.
>>
>> On Fri, Apr 23, 2010 at 9:39 AM, Joost Ouwerkerk <jo...@openplaces.org>
>> wrote:
>> > Awesome.  In the meantime, I hacked something similar myself.  The
>> > performance difference does not appear to be material.  I think the real
>> > killer is the get_range_slices call.  Relative to that, the cost of
>> > getting
>> > the connection appears to be more or less trivial.  What can I do to
>> > alleviate that cost?  CASSANDRA-821 looks interesting -- can I apply
>> > that to
>> > 0.6.1 ?
>> > joost.
>> > On Fri, Apr 23, 2010 at 9:39 AM, Jonathan Ellis <jbel...@gmail.com>
>> > wrote:
>> >>
>> >> Great!  Created https://issues.apache.org/jira/browse/CASSANDRA-1017
>> >> to track this.
>> >>
>> >> On Fri, Apr 23, 2010 at 4:12 AM, Johan Oskarsson <jo...@oskarsson.nu>
>> >> wrote:
>> >> > I have written some code to avoid thrift reconnection, it just keeps
>> >> > the
>> >> > connection open between get_range_slices calls.
>> >> > I can extract that and put it up but not until early next week.
>> >> >
>> >> > /Johan
>> >> >
>> >> > On 23 apr 2010, at 05.09, Jonathan Ellis wrote:
>> >> >
>> >> >> That would be an easy win, sure.
>> >> >>
>> >> >> On Thu, Apr 22, 2010 at 9:27 PM, Joost Ouwerkerk
>> >> >> <jo...@openplaces.org>
>> >> >> wrote:
>> >> >>> I was getting client timeouts in
>> >> >>> ColumnFamilyRecordReader.maybeInit()
>> >> >>> when
>> >> >>> MapReducing.  So I've reduced the Range Batch Size to 256 (from
>> >> >>> 4096)
>> >> >>> and
>> >> >>> this seems to have fixed my problem, although it has slowed things
>> >> >>> down a
>> >> >>> bit -- presumably because there are 16x more calls to
>> >> >>> get_range_slices.
>> >> >>> While I was in that code I noticed that a new client was being
>> >> >>> created
>> >> >>> for
>> >> >>> each batch get.  By decreasing the batch size, I've increased this
>> >> >>> overhead.  I'm thinking of re-writing ColumnFamilyRecordReader to
>> >> >>> do
>> >> >>> some
>> >> >>> connection pooling.  Anyone have any thoughts on that?
>> >> >>> joost.
>> >> >>>
>> >> >
>> >> >
>> >
>> >
>
>

Re: MapReduce, Timeouts and Range Batch Size

Reply via email to