Hi Michael, Thanks for your answer. My path is exactly as you mention: s3://my-bucket/<year>/<month>/ <date>/*.avro For sure i'm not using wild cards in any other part besides the date. So i don't think the issue could be that. The weird thing is that on top of the same data set, randomly in 1 of every 20 jobs one of the tasks gets stuck while reading.
Some other stats: The number of files I have in the folder is 48. The number of partitions used when reading data is 7315. The maximum size of a file to read is 14G The size of the folder is around: 270G 2015-11-12 10:58 GMT+01:00 Michael Cutler <mich...@cutler.io>: > Reading files directly from Amazon S3 can be frustrating especially if > you're dealing with a large number of input files, could you please > elaborate more on your use-case? Does the S3 bucket in question already > contain a large number of files? > > The implementation of the * wildcard operator in S3 input paths requires > an AWS S3 API call to list everything based on the common-prefix; so if > your input is something like; > > s3://my-bucket/<year>/<month>/<date>/*.json > > Then the prefix "<year>/<month>/<date>/" will be passed to the API and > should be fairly efficient. > > However if you're doing something more adventurous like; > > s3://my-bucket/*/*/*/*.json > > There is no common-prefix to give the API here, it will literally list > every object in the bucket and then filter client-side to find anything > that matches "*.json", these types of requests are prone to timeouts and > other intermittent issues as well as taking a ridiculous amount of time > before the job can start. > > -- Alessandro Chacón Aecc_ORG