Thanks, Vadim! That helps and makes sense. I don't think we have a number of 
keys so large that we have to worry about it. If we do, I think I would go with 
an approach similar to what you suggested.

Thanks again,
Subhash 

Sent from my iPhone

> On Mar 8, 2018, at 11:56 AM, Vadim Semenov <va...@datadoghq.com> wrote:
> 
> You need to put randomness into the beginning of the key, if you put it other 
> than into the beginning, it's not guaranteed that you're going to have good 
> performance.
> 
> The way we achieved this is by writing to HDFS first, and then having a 
> custom DistCp implemented using Spark that copies parquet files using random 
> keys,
> and then saves the list of resulting keys to S3, and when we want to use 
> those parquet files, we just need to load the listing file, and then take 
> keys from it and pass them into the loader.
> 
> You only need to do this when you have way too many files, if the number of 
> keys you operate is reasonably small (let's say, in thousands), you won't get 
> any benefits.
> 
> Also the S3 buckets have internal optimizations, and overtime it adjusts to 
> the workload, i.e. some additional underlying partitions are getting added, 
> some splits happen, etc.
> If you want to have good performance from start, you would need to use 
> randomization, yes.
> Or alternatively, you can contact AWS and tell them about the naming schema 
> that you're going to have (but it must be set in stone), and then they can 
> try to pre-optimize the bucket for you.
> 
>> On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram <subhash.sri...@gmail.com> 
>> wrote:
>> Hey Spark user community,
>> 
>> I am writing Parquet files from Spark to S3 using S3a. I was reading this 
>> article about improving S3 bucket performance, specifically about how it can 
>> help to introduce randomness to your key names so that data is written to 
>> different partitions.
>> 
>> https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/
>> 
>> Is there a straight forward way to accomplish this randomness in Spark via 
>> the DataSet API? The only thing that I could think of would be to actually 
>> split the large set into multiple sets (based on row boundaries), and then 
>> write each one with the random key name.
>> 
>> Is there an easier way that I am missing?
>> 
>> Thanks in advance!
>> Subhash
>> 
>> 
> 

Reply via email to