Not that I'm professional user of Amazon services, but I have a guess about
your performance issues. From [1], there are two different filesystems over
S3:

 - native that behaves just like regular files (schema: s3n)
 - block-based that looks more like HDFS (schema: s3)

Since you use "s3n" in your URL, each Spark worker seems to treat the file
as unsplittable piece of data and downloads it all (though, probably,
applies functions to specific regions only). If I understand it right,
using "s3" instead will allow Spark workers see data as a sequence of
blocks and download each block separately.

But anyway, using s3 Implies loss of data locality, so data will be
transferred to workers instead of code being transferred to data. Given
data size of 1.2Gb, consider also storing data in Hadoop's HDFS instead of
S3 (as far as I remember, Amazon allows using both at the same time).

Please, let us know if it works.


[1]: https://wiki.apache.org/hadoop/AmazonS3

On Sat, Nov 22, 2014 at 6:21 PM, Nitay Joffe <[email protected]> wrote:

> Err I meant #1 :)
>
> - Nitay
> Founder & CTO
>
>
> On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe <[email protected]> wrote:
>
>> Anyone have any thoughts on this? Trying to understand especially #2 if
>> it's a legit bug or something I'm doing wrong.
>>
>> - Nitay
>> Founder & CTO
>>
>>
>> On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe <[email protected]> wrote:
>>
>>> I have a simple S3 job to read a text file and do a line count.
>>> Specifically I'm doing *sc.textFile("s3n://mybucket/myfile").count*.The
>>> file is about 1.2GB. My setup is standalone spark cluster with 4 workers
>>> each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
>>> hadoop 2.4 (though I'm not actually using HDFS, just straight S3 => Spark).
>>>
>>> The whole count is taking on the order of a couple of minutes, which
>>> seems extremely slow.
>>> I've been looking into it and so far have noticed two things, hoping the
>>> community has seen this before and knows what to do...
>>>
>>> 1) Every executor seems to make an S3 call to read the *entire file* before
>>> making another call to read just it's split. Here's a paste I've cleaned up
>>> to show just one task: http://goo.gl/XCfyZA. I've verified this happens
>>> in every task. It is taking a long time (40-50 seconds), I don't see why it
>>> is doing this?
>>> 2) I've tried a few numPartitions parameters. When I make the parameter
>>> anything below 21 it seems to get ignored. Under the hood FileInputFormat
>>> is doing something that always ends up with at least 21 partitions of ~64MB
>>> or so. I've also tried 40, 60, and 100 partitions and have seen that the
>>> performance only gets worse as I increase it beyond 21. I would like to try
>>> 8 just to see, but again I don't see how to force it to go below 21.
>>>
>>> Thanks for the help,
>>> - Nitay
>>> Founder & CTO
>>>
>>>
>>
>

Reply via email to