Cool, we are still sticking with 1.3.1, will upgrade to 1.5 ASAP. Thanks for the tips, Tathagata!
On Wed, Sep 23, 2015 at 10:40 AM Tathagata Das <t...@databricks.com> wrote: > A lot of these imbalances were solved in spark 1.5. Could you give that a > spin? > > https://issues.apache.org/jira/browse/SPARK-8882 > > On Tue, Sep 22, 2015 at 12:17 AM, SLiZn Liu <sliznmail...@gmail.com> > wrote: > >> Hi spark users, >> >> In our Spark Streaming app via Kafka integration on Mesos, we initialed 3 >> receivers to receive 3 Kafka partitions, whereas records receiving rate >> imbalance been observed, with spark.streaming.receiver.maxRate is set to >> 120, sometimes 1 of which receives very close to the limit while the >> other two only at roughly fifty per second. >> >> This may be caused by previous receiver failure, where one of the >> receivers’ receiving rate drop to 0. We restarted the Spark Streaming app, >> and the imbalance began. We suspect that the partition which received by >> the failing receiver got jammed, and the other two receivers cannot take up >> its data. >> >> The 3-nodes cluster tends to run slowly, nearly all the tasks is >> registered at the node with previous receiver failure(I used unionto >> combine 3 receivers’ DStream, thus I expect the combined DStream is well >> distributed across all nodes), cannot guarantee to finish one batch in a >> single batch time, stages get piled up, and the digested log shows as >> following: >> >> ... >> 5728.399: [GC (Allocation Failure) [PSYoungGen: 6954678K->17088K(6961152K)] >> 7074614K->138108K(20942336K), 0.0203877 secs] [Times: user=0.20 sys=0.00, >> real=0.02 secs] >> >> ... >> 5/09/22 13:33:35 ERROR TaskSchedulerImpl: Ignoring update with state >> FINISHED for TID 77219 because its task set is gone (this is likely the >> result of >> receiving duplicate task finished status updates) >> >> ... >> >> the two type of log was printed in execution of some (not all) stages. >> >> My configurations: >> # of cores on each node: 64 >> # of nodes: 3 >> batch time is set to 10 seconds >> >> spark.streaming.receiver.maxRate 120 >> spark.streaming.blockInterval 160 // set to the value that >> divides 10 seconds approx. to total cores, which is 64, to max out all the >> nodes: 10s * 1000 / 64 >> spark.storage.memoryFraction 0.1 // this one doesn't seem to >> work, since the young gen / old gen ratio is nearly 0.3 instead of 0.1 >> >> anyone got an idea? Appreciate for your patience. >> >> BR, >> Todd Leo >> >> > >