Can you also try increasing the akka framesize? .set("spark.akka.frameSize","50") // Set it to a higher number
Thanks Best Regards On Sat, Jan 24, 2015 at 3:58 AM, Darin McBeath <ddmcbe...@yahoo.com.invalid> wrote: > Thanks for the ideas Sven. > > I'm using stand-alone cluster (Spark 1.2). > FWIW, I was able to get this running (just now). This is the first time > it's worked in probably my last 10 attempts. > > In addition to limiting the executors to only 50% of the cluster. In the > settings below, I additionally added/changed the following. Maybe, I just > got lucky (although I think not). Would be good if someone could weigh in > and agree that these changes are sensible. I'm also hoping the support for > placement groups (targeted for 1.3 in the ec2 scripts) will help the > situation. All in all, it takes about 45 minutes to write a 1 TB file back > to S3 (as 1024 partitions). > > > SparkConf conf = new SparkConf() > .setAppName("SparkSync Application") > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > .set("spark.rdd.compress","true") > .set("spark.core.connection.ack.wait.timeout","600") > .set("spark.akka.timeout","600") // Increased from 300 > .set("spark.akka.threads","16") // Added so that default was > increased from 4 to 16 > .set("spark.task.maxFailures","64") // Didn't really matter as I had > no failures in this run > .set("spark.storage.blockManagerSlaveTimeoutMs","300000"); > > > ________________________________ > From: Sven Krasser <kras...@gmail.com> > To: Darin McBeath <ddmcbe...@yahoo.com> > Cc: User <user@spark.apache.org> > Sent: Friday, January 23, 2015 5:12 PM > Subject: Re: Problems saving a large RDD (1 TB) to S3 as a sequence file > > > > Hey Darin, > > Are you running this over EMR or as a standalone cluster? I've had > occasional success in similar cases by digging through all executor logs > and trying to find exceptions that are not caused by the application > shutdown (but the logs remain my main pain point with Spark). > > That aside, another explanation could be S3 throttling you due to volume > (and hence causing write requests to fail). You can try to split your file > into multiple pieces and store those as S3 objects with different prefixes > to make sure they end up in different partitions in S3. See here for > details: > http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html. > If that works, that'll narrow the cause down. > > Best, > -Sven > > > > > > > On Fri, Jan 23, 2015 at 12:04 PM, Darin McBeath > <ddmcbe...@yahoo.com.invalid> wrote: > > I've tried various ideas, but I'm really just shooting in the dark. > > > >I have an 8 node cluster of r3.8xlarge machines. The RDD (with 1024 > partitions) I'm trying to save off to S3 is approximately 1TB in size (with > the partitions pretty evenly distributed in size). > > > >I just tried a test to dial back the number of executors on my cluster > from using the entire cluster (256 cores) down to 128. Things seemed to > get a bit farther (maybe) before the wheels started spinning off again. > But, the job always fails when all I'm trying to do is save the 1TB file to > S3. > > > >I see the following in my master log file. > > > >15/01/23 19:01:54 WARN master.Master: Removing worker-20150123172316 > because we got no heartbeat in 60 seconds > >15/01/23 19:01:54 INFO master.Master: Removing worker > worker-20150123172316 on > >15/01/23 19:01:54 INFO master.Master: Telling app of lost executor: 3 > > > >For the stage that eventually fails, I see the following summary > information. > > > >Summary Metrics for 729 Completed Tasks > >Duration 2.5 min 4.8 min 5.5 min 6.3 min 9.2 min > >GC Time 0 ms 0.3 s 0.4 s 0.5 s 5 s > > > >Shuffle Read (Remote) 309.3 MB 321.7 MB 325.4 MB 329.6 MB 350.6 MB > > > >So, the max GC was only 5s for 729 completed tasks. This sounds > reasonable. As people tend to indicate GC is the reason one loses > executors, this does not appear to be my case. > > > >Here is a typical snapshot for some completed tasks. So, you can see > that they tend to complete in approximately 6 minutes. So, it takes about > 6 minutes to write one partition to S3 (a partition being roughly 1 GB) > > > >65 23619 0 SUCCESS ANY 5 / 2015/01/23 18:30:32 > 5.8 min 0.9 s 344.6 MB > >59 23613 0 SUCCESS ANY 7 / 2015/01/23 18:30:32 > 6.0 min 0.4 s 324.1 MB > >68 23622 0 SUCCESS ANY 1 / 2015/01/23 18:30:32 > 5.7 min 0.5 s 329.9 MB > >62 23616 0 SUCCESS ANY 6 / 2015/01/23 18:30:32 > 5.8 min 0.7 s 326.4 MB > >61 23615 0 SUCCESS ANY 3 / 2015/01/23 18:30:32 > 5.5 min 1 s 335.7 MB > >64 23618 0 SUCCESS ANY 2 / 2015/01/23 18:30:32 > 5.6 min 2 s 328.1 MB > > > >Then towards the end, when things start heading south, I see the > following. These tasks never complete but you can see that they have taken > more than 47 minutes (so far) before the job finally fails. Not really > sure why. > > > >671 24225 0 RUNNING ANY 1 / 2015/01/23 18:59:14 > 47 min > >672 24226 0 RUNNING ANY 1 / 2015/01/23 18:59:14 > 47 min > >673 24227 0 RUNNING ANY 1 / 2015/01/23 18:59:14 > 47 min > >674 24228 0 RUNNING ANY 1 / 2015/01/23 18:59:14 > 47 min > >675 24229 0 RUNNING ANY 1 / 2015/01/23 18:59:14 > 47 min > >676 24230 0 RUNNING ANY 1 / 2015/01/23 18:59:14 > 47 min > >677 24231 0 RUNNING ANY 1 / 2015/01/23 18:59:14 > 47 min > >678 24232 0 RUNNING ANY 1 / 2015/01/23 18:59:14 > 47 min > >679 24233 0 RUNNING ANY 1 / 2015/01/23 18:59:14 > 47 min > >680 24234 0 RUNNING ANY 1 / 2015/01/23 18:59:17 > 47 min > >681 24235 0 RUNNING ANY 1 / 2015/01/23 18:59:18 > 47 min > >682 24236 0 RUNNING ANY 1 / 2015/01/23 18:59:18 > 47 min > >683 24237 0 RUNNING ANY 5 / 2015/01/23 18:59:20 > 47 min > >684 24238 0 RUNNING ANY 5 / 2015/01/23 18:59:20 > 47 min > >685 24239 0 RUNNING ANY 5 / 2015/01/23 18:59:20 > 47 min > >686 24240 0 RUNNING ANY 5 / 2015/01/23 18:59:20 > 47 min > >687 24241 0 RUNNING ANY 5 / 2015/01/23 18:59:20 > 47 min > >688 24242 0 RUNNING ANY 5 / 2015/01/23 18:59:20 > 47 min > >689 24243 0 RUNNING ANY 5 / 2015/01/23 18:59:20 > 47 min > >690 24244 0 RUNNING ANY 5 / 2015/01/23 18:59:20 > 47 min > >691 24245 0 RUNNING ANY 5 / 2015/01/23 18:59:21 > 47 min > > > >What's odd is that even on the same machine (see below) some tasks are > still completing (in less than 5 minutes) while other tasks on the same > machine seem to be hung after 46 minutes. Keep in mind all I'm doing is > saving the file to S3 so one would think the amount of work per > task/partition would be fairly equal. > > > >694 24248 0 SUCCESS ANY 0 / 2015/01/23 18:59:32 > 4.5 min 0.3 s 326.5 MB > >695 24249 0 SUCCESS ANY 0 / 2015/01/23 18:59:32 > 4.5 min 0.3 s 330.8 MB > >696 24250 0 RUNNING ANY 0 / 2015/01/23 18:59:32 > 46 min > >697 24251 0 RUNNING ANY 0 / 2015/01/23 18:59:32 > 46 min > >698 24252 0 SUCCESS ANY 0 / 2015/01/23 18:59:32 > 4.5 min 0.3 s 325.8 MB > >699 24253 0 SUCCESS ANY 0 / 2015/01/23 18:59:32 > 4.5 min 0.3 s 325.2 MB > >700 24254 0 SUCCESS ANY 0 / 2015/01/23 18:59:32 > 4.5 min 0.3 s 323.4 MB > > > >If anyone has some suggestions please let me know. I've tried playing > around with various configuration options but I've found nothing yet that > will fix the underlying issue. > > > >Thanks. > > > >Darin. > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >For additional commands, e-mail: user-h...@spark.apache.org > > > > > > > -- > > http://sites.google.com/site/krasser/?utm_source=sig > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >