Hadoop doesn't make any assumptions about how input source data is
distributed.  It can't 'know' that the data for the first 30 splits emitted
by the InputFormat are all stored on the same cassandra node.

The new case with the patch is CASSANDRA-1096

Meanwhile. I'm still getting TimedOutException errors when mapping this
30-million row table, even when retrieving no data at all.  It looks like it
is related to disk activity on "hot" nodes (when the same cassandra node has
to handle many concurrent requests for adjacent range slices).  Using 0.7
trunk branch doesn't appear to alleviate it.  CPU load is at about 25% when
this happens.  Is there some kind of synchronization that might prevent the
same file from being scanned by multiple threads?

joost.

On Sat, May 15, 2010 at 10:55 AM, Jonathan Ellis <jbel...@gmail.com> wrote:

> Oh, very interesting.  I assumed Hadoop would be smart enough to
> load-balance the jobs it sends out.  Guess not.
>
> Can you submit a patch?
>
> On Wed, May 12, 2010 at 12:32 PM, Joost Ouwerkerk <jo...@openplaces.org>
> wrote:
> > I've been trying to improve the time it takes to map 30 million rows
> using a
> > hadoop / cassandra cluster with 30 nodes.  I discovered that since
> > CassandraInputFormat returns an ordered list of splits, when there are
> many
> > splits (e.g. hundreds or more) the load on cassandra is horribly
> unbalanced.
> >  e.g. if I have 30 tasks processing 600 splits, then the first 30 splits
> are
> > all located on the same one or two nodes.
> > I added Collections.shuffle(splits) before returning the splits in
> > getSplits().  As a result, the load is much better distributed,
> throughput
> >  was increased (about 3X in my case) and TimedOutExceptions were all but
> > eliminated.
> > Joost.
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Reply via email to