What makes you think that each executor is reading the whole file? If that is the case then the count value returned to the driver will be actual X NumOfExecutors. Is that the case when compared with actual lines in the input file? If the count returned is same as actual then you probably don't have an extra read problem.
I also see this in your logs which indicates that it is a read that starts from an offset and reading one split size (64MB) worth of data: 14/11/20 15:39:45 [Executor task launch worker-1 ] INFO HadoopRDD: Input split: s3n://mybucket/myfile:335544320+67108864 On Nov 22, 2014 7:23 AM, "Nitay Joffe" <[email protected]> wrote: > Err I meant #1 :) > > - Nitay > Founder & CTO > > > On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe <[email protected]> wrote: > >> Anyone have any thoughts on this? Trying to understand especially #2 if >> it's a legit bug or something I'm doing wrong. >> >> - Nitay >> Founder & CTO >> >> >> On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe <[email protected]> wrote: >> >>> I have a simple S3 job to read a text file and do a line count. >>> Specifically I'm doing *sc.textFile("s3n://mybucket/myfile").count*.The >>> file is about 1.2GB. My setup is standalone spark cluster with 4 workers >>> each with 2 cores / 16GB ram. I'm using branch-1.2 code built against >>> hadoop 2.4 (though I'm not actually using HDFS, just straight S3 => Spark). >>> >>> The whole count is taking on the order of a couple of minutes, which >>> seems extremely slow. >>> I've been looking into it and so far have noticed two things, hoping the >>> community has seen this before and knows what to do... >>> >>> 1) Every executor seems to make an S3 call to read the *entire file* before >>> making another call to read just it's split. Here's a paste I've cleaned up >>> to show just one task: http://goo.gl/XCfyZA. I've verified this happens >>> in every task. It is taking a long time (40-50 seconds), I don't see why it >>> is doing this? >>> 2) I've tried a few numPartitions parameters. When I make the parameter >>> anything below 21 it seems to get ignored. Under the hood FileInputFormat >>> is doing something that always ends up with at least 21 partitions of ~64MB >>> or so. I've also tried 40, 60, and 100 partitions and have seen that the >>> performance only gets worse as I increase it beyond 21. I would like to try >>> 8 just to see, but again I don't see how to force it to go below 21. >>> >>> Thanks for the help, >>> - Nitay >>> Founder & CTO >>> >>> >> >
