Paulo,
If you have large data sizes then the vnodes with hadoop issue is moot.  You 
will get that many splits with or without vnodes.  The issues come when you 
don't have a lot of data, so all the extra splits slow everything down to a 
crawl because there are 256 times as many tasks created as you actually needed 
for your job.

So for large data sets, there is no issue.  For small data sets, you can run 
jobs, they will just be slower than if you didn't have vnodes.

-Jeremiah

On Oct 17, 2013, at 3:49 PM, Paulo Motta <pauloricard...@gmail.com> wrote:

> Hello,
> 
> According to DSE3.1 documentation [1], "DataStax recommends using virtual 
> nodes only on data centers running purely Cassandra workloads. You should 
> disable virtual nodes on data centers running either Hadoop or Solr workloads 
> by setting num_tokens to 1.".
> 
> There was a thread in this mailing list earlier this year [2], where it was 
> suggested a workaround to the problem of having a minimum of one map task per 
> token (unfeasible with vnodes). This suggestion involved implementing a new 
> Hadoop InputSplitFormat that could combine many tokens from a single node, 
> thus reducing the overhead of having too many tasks per node. 
> 
> Is there any JIRA ticket around this issue yet, or something being worked on 
> to support VNodes for Hadoop workloads, or the suggestion remains to avoid 
> VNodes for analytics workloads (hadoop, solr)?
> 
> Thanks, 
> 
> -- 
> Paulo
> 
> [1] 
> http://www.datastax.com/docs/datastax_enterprise3.1/deploy/configuring_replication
> [2] 
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201302.mbox/%3CCAJV_UYdqYmfStn5OetWrozQqbi+-yP3X-Ew9xtW=QY=2zgy...@mail.gmtokenail.com%3E

Reply via email to