Re: MapReduce, Timeouts and Range Batch Size

Joost Ouwerkerk Fri, 23 Apr 2010 08:31:05 -0700

In that case I should probably wait for 0.7.  Is there any fundamental
performance difference in get_range_slices between Random and
Order-Preserving partitioners.  If so, by what factor?
joost.


On Fri, Apr 23, 2010 at 10:47 AM, Jonathan Ellis <jbel...@gmail.com> wrote:

> You could look into it, but it's not going to be an easy backport
> since SSTableReader and SSTableScanner got split into two classes in
> trunk.
>
> On Fri, Apr 23, 2010 at 9:39 AM, Joost Ouwerkerk <jo...@openplaces.org>
> wrote:
> > Awesome.  In the meantime, I hacked something similar myself.  The
> > performance difference does not appear to be material.  I think the real
> > killer is the get_range_slices call.  Relative to that, the cost of
> getting
> > the connection appears to be more or less trivial.  What can I do to
> > alleviate that cost?  CASSANDRA-821 looks interesting -- can I apply that
> to
> > 0.6.1 ?
> > joost.
> > On Fri, Apr 23, 2010 at 9:39 AM, Jonathan Ellis <jbel...@gmail.com>
> wrote:
> >>
> >> Great!  Created https://issues.apache.org/jira/browse/CASSANDRA-1017
> >> to track this.
> >>
> >> On Fri, Apr 23, 2010 at 4:12 AM, Johan Oskarsson <jo...@oskarsson.nu>
> >> wrote:
> >> > I have written some code to avoid thrift reconnection, it just keeps
> the
> >> > connection open between get_range_slices calls.
> >> > I can extract that and put it up but not until early next week.
> >> >
> >> > /Johan
> >> >
> >> > On 23 apr 2010, at 05.09, Jonathan Ellis wrote:
> >> >
> >> >> That would be an easy win, sure.
> >> >>
> >> >> On Thu, Apr 22, 2010 at 9:27 PM, Joost Ouwerkerk <
> jo...@openplaces.org>
> >> >> wrote:
> >> >>> I was getting client timeouts in
> ColumnFamilyRecordReader.maybeInit()
> >> >>> when
> >> >>> MapReducing.  So I've reduced the Range Batch Size to 256 (from
> 4096)
> >> >>> and
> >> >>> this seems to have fixed my problem, although it has slowed things
> >> >>> down a
> >> >>> bit -- presumably because there are 16x more calls to
> >> >>> get_range_slices.
> >> >>> While I was in that code I noticed that a new client was being
> created
> >> >>> for
> >> >>> each batch get.  By decreasing the batch size, I've increased this
> >> >>> overhead.  I'm thinking of re-writing ColumnFamilyRecordReader to do
> >> >>> some
> >> >>> connection pooling.  Anyone have any thoughts on that?
> >> >>> joost.
> >> >>>
> >> >
> >> >
> >
> >
>

Re: MapReduce, Timeouts and Range Batch Size

Reply via email to