Hi, I had a simillar problem with Cassandra 0.8.x and the problem was when configured Cassandra with rpc_address: 0.0.0.0 and starting Hadoop job from outside the Cassandra cluster. But with version 1.0.x the problem is gone.
You can debug the splits with thrift. This is a copy&paste part of my splits testing Python utility: print "describe_ring" res = client.describe_ring(argv[1]) for t in res: print "%s - %s [%s] [%s]" % (t.start_token, t.end_token, ",".join(t.endpoints), ",".join(t.rpc_endpoints),) for r in res: res2 = client.describe_splits('PageData', r.start_token, r.end_token, 24*1024) It asks Cassandra for a list of nodes with their key ranges, then asks each node for slits. You should adjust the 24*1024 split size. Regards, Patrik On Tue, May 1, 2012 at 5:58 PM, Filippo Diotalevi <fili...@ntoklo.com> wrote: > Hi, > I'm having problems in my Cassandra/Hadoop (1.0.8 + cdh3u3) cluster related > to how cassandra splits the data to be processed by Hadoop. > > I'm currently testing a map reduce job, starting from a CF of roughly 1500 > rows, with > > cassandra.input.split.size 10 > cassandra.range.batch.size 1 > > but what I consistently see is that, while most of the task have 1-20 rows > assigned each, one of them is assigned 400+ rows, which gives me all sort of > problems in terms of timeouts and memory consumption (not to mention seeing > the mapper progress bar going to 4000% and more). > > Do you have any suggestion to solve/troublehsoot this issue? > > -- > Filippo Diotalevi