Re: Spark task hangs infinitely when accessing S3 from AWS

Alessandro Chacón Thu, 12 Nov 2015 03:00:04 -0800

Hi Michael,

Thanks for your answer.
My path is exactly as you mention:  s3://my-bucket/<year>/<month>/
<date>/*.avro
For sure i'm not using wild cards in any other part besides the date. So i
don't think the issue could be that.
The weird thing is that on top of the same data set, randomly in 1 of every
20 jobs one of the tasks gets stuck while reading.


Some other stats:

The number of files I have in the folder is 48.
The number of partitions used when reading data is 7315.
The maximum size of a file to read is 14G
The size of the folder is around: 270G

2015-11-12 10:58 GMT+01:00 Michael Cutler <mich...@cutler.io>:

> Reading files directly from Amazon S3 can be frustrating especially if
> you're dealing with a large number of input files, could you please
> elaborate more on your use-case?  Does the S3 bucket in question already
> contain a large number of files?
>
> The implementation of the * wildcard operator in S3 input paths requires
> an AWS S3 API call to list everything based on the common-prefix; so if
> your input is something like;
>
>   s3://my-bucket/<year>/<month>/<date>/*.json
>
> Then the prefix "<year>/<month>/<date>/" will be passed to the API and
> should be fairly efficient.
>
> However if you're doing something more adventurous like;
>
>   s3://my-bucket/*/*/*/*.json
>
> There is no common-prefix to give the API here, it will literally list
> every object in the bucket and then filter client-side to find anything
> that matches "*.json", these types of requests are prone to timeouts and
> other intermittent issues as well as taking a ridiculous amount of time
> before the job can start.
>
>


-- 
Alessandro Chacón
Aecc_ORG

Re: Spark task hangs infinitely when accessing S3 from AWS

Reply via email to