-dev, +user A decent guess: Does your 'save' function entail collecting data back to the driver? and are you running this from a machine that's not in your Spark cluster? Then in client mode you're shipping data back to a less-nearby machine, compared to with cluster mode. That could explain the bottleneck.
On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji <eshi...@gmail.com> wrote: > Hi, > > I have a very, very simple streaming job. When I deploy this on the exact > same cluster, with the exact same parameters, I see big (40%) performance > difference between "client" and "cluster" deployment mode. This seems a bit > surprising.. Is this expected? > > The streaming job is: > > val msgStream = kafkaStream > .map { case (k, v) => v} > .map(DatatypeConverter.printBase64Binary) > .foreachRDD(save) > .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec]) > > I tried several times, but the job deployed with "client" mode can only > write at 60% throughput of the job deployed with "cluster" mode and this > happens consistently. I'm logging at INFO level, but my application code > doesn't log anything so it's only Spark logs. The logs I see in "client" > mode doesn't seem like a crazy amount. > > The setup is: > spark-ec2 [...] \ > --copy-aws-credentials \ > --instance-type=m3.2xlarge \ > -s 2 launch test_cluster > > And all the deployment was done from the master machine. > > ᐧ --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org