Paulo, If you have large data sizes then the vnodes with hadoop issue is moot. You will get that many splits with or without vnodes. The issues come when you don't have a lot of data, so all the extra splits slow everything down to a crawl because there are 256 times as many tasks created as you actually needed for your job.
So for large data sets, there is no issue. For small data sets, you can run jobs, they will just be slower than if you didn't have vnodes. -Jeremiah On Oct 17, 2013, at 3:49 PM, Paulo Motta <pauloricard...@gmail.com> wrote: > Hello, > > According to DSE3.1 documentation [1], "DataStax recommends using virtual > nodes only on data centers running purely Cassandra workloads. You should > disable virtual nodes on data centers running either Hadoop or Solr workloads > by setting num_tokens to 1.". > > There was a thread in this mailing list earlier this year [2], where it was > suggested a workaround to the problem of having a minimum of one map task per > token (unfeasible with vnodes). This suggestion involved implementing a new > Hadoop InputSplitFormat that could combine many tokens from a single node, > thus reducing the overhead of having too many tasks per node. > > Is there any JIRA ticket around this issue yet, or something being worked on > to support VNodes for Hadoop workloads, or the suggestion remains to avoid > VNodes for analytics workloads (hadoop, solr)? > > Thanks, > > -- > Paulo > > [1] > http://www.datastax.com/docs/datastax_enterprise3.1/deploy/configuring_replication > [2] > http://mail-archives.apache.org/mod_mbox/cassandra-user/201302.mbox/%3CCAJV_UYdqYmfStn5OetWrozQqbi+-yP3X-Ew9xtW=QY=2zgy...@mail.gmtokenail.com%3E