Hi,

I have a very, very simple streaming job. When I deploy this on the exact
same cluster, with the exact same parameters, I see big (40%) performance
difference between "client" and "cluster" deployment mode. This seems a bit
surprising.. Is this expected?

The streaming job is:

    val msgStream = kafkaStream
      .map { case (k, v) => v}
      .map(DatatypeConverter.printBase64Binary)
      .foreachRDD(save)
      .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec])

I tried several times, but the job deployed with "client" mode can only
write at 60% throughput of the job deployed with "cluster" mode and this
happens consistently. I'm logging at INFO level, but my application code
doesn't log anything so it's only Spark logs. The logs I see in "client"
mode doesn't seem like a crazy amount.

The setup is:
spark-ec2 [...] \
  --copy-aws-credentials \
  --instance-type=m3.2xlarge \
  -s 2 launch test_cluster

And all the deployment was done from the master machine.

ᐧ

Reply via email to