Thanks Eric for the appreciation :)
Default split size is 64K rows. ColumnFamilyInputFormat first collects all
tokens and create a split for each. if you have 256 vnode for each node
that it creates 256 splits even if you have no data at all. current split
size will only work if you have a vnode t
Split size does not have to equal block size.
http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
An abstract InputFormat that returns CombineFileSplit's in
InputFormat.getSplits(JobConf, int) method. Splits are constructed
from the files under the in
Wouldn't you have more than 256 splits anyway, given a normal amount of data?
(Default split size is 64k rows.)
On Fri, Feb 15, 2013 at 7:01 PM, Edward Capriolo wrote:
> Seems like the hadoop Input format should combine the splits that are
> on the same node into the same map task, like Hadoop's
On Sat, Feb 16, 2013 at 9:13 AM, Edward Capriolo wrote:
> No one had ever tried vnodes with hadoop until the OP did, or they
> would have noticed this. No one extensively used it with secondary
> indexes either from the last ticket I mentioned.
>
> My mistake they are not a default.
>
> I do think
No one had ever tried vnodes with hadoop until the OP did, or they
would have noticed this. No one extensively used it with secondary
indexes either from the last ticket I mentioned.
My mistake they are not a default.
I do think vnodes are awesome, its great that c* has the longer
release cylcle.
On Fri, Feb 15, 2013 at 7:01 PM, Edward Capriolo wrote:
> Seems like the hadoop Input format should combine the splits that are
> on the same node into the same map task, like Hadoop's
> CombinedInputFormat can. I am not sure who recommends vnodes as the
> default, because this is now the second p
Seems like the hadoop Input format should combine the splits that are
on the same node into the same map task, like Hadoop's
CombinedInputFormat can. I am not sure who recommends vnodes as the
default, because this is now the second problem (that I know of) of
this class where vnodes has extra over
Hi All,
I have just started to use virtual nodes. I set the number of nodes to 256
as recommended.
The problem that I have is when I run a mapreduce job it creates node * 256
mappers. It creates node * 256 splits. this effects the performance since
the range queries have a lot of overhead.
Any s