What makes you think that each executor is reading the whole file? If that
is the case then the count value returned to the driver will be actual X
NumOfExecutors. Is that the case when compared with actual lines in the
input file? If the count returned is same as actual then you probably don't
have an extra read problem.

I also see this in your logs which indicates that it is a read that starts
from an offset and reading one split size (64MB) worth of data:

14/11/20 15:39:45 [Executor task launch worker-1 ] INFO HadoopRDD: Input
split: s3n://mybucket/myfile:335544320+67108864
On Nov 22, 2014 7:23 AM, "Nitay Joffe" <[email protected]> wrote:

> Err I meant #1 :)
>
> - Nitay
> Founder & CTO
>
>
> On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe <[email protected]> wrote:
>
>> Anyone have any thoughts on this? Trying to understand especially #2 if
>> it's a legit bug or something I'm doing wrong.
>>
>> - Nitay
>> Founder & CTO
>>
>>
>> On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe <[email protected]> wrote:
>>
>>> I have a simple S3 job to read a text file and do a line count.
>>> Specifically I'm doing *sc.textFile("s3n://mybucket/myfile").count*.The
>>> file is about 1.2GB. My setup is standalone spark cluster with 4 workers
>>> each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
>>> hadoop 2.4 (though I'm not actually using HDFS, just straight S3 => Spark).
>>>
>>> The whole count is taking on the order of a couple of minutes, which
>>> seems extremely slow.
>>> I've been looking into it and so far have noticed two things, hoping the
>>> community has seen this before and knows what to do...
>>>
>>> 1) Every executor seems to make an S3 call to read the *entire file* before
>>> making another call to read just it's split. Here's a paste I've cleaned up
>>> to show just one task: http://goo.gl/XCfyZA. I've verified this happens
>>> in every task. It is taking a long time (40-50 seconds), I don't see why it
>>> is doing this?
>>> 2) I've tried a few numPartitions parameters. When I make the parameter
>>> anything below 21 it seems to get ignored. Under the hood FileInputFormat
>>> is doing something that always ends up with at least 21 partitions of ~64MB
>>> or so. I've also tried 40, 60, and 100 partitions and have seen that the
>>> performance only gets worse as I increase it beyond 21. I would like to try
>>> 8 just to see, but again I don't see how to force it to go below 21.
>>>
>>> Thanks for the help,
>>> - Nitay
>>> Founder & CTO
>>>
>>>
>>
>

Reply via email to