Thanks, Vadim! That helps and makes sense. I don't think we have a number of keys so large that we have to worry about it. If we do, I think I would go with an approach similar to what you suggested.
Thanks again, Subhash Sent from my iPhone > On Mar 8, 2018, at 11:56 AM, Vadim Semenov <va...@datadoghq.com> wrote: > > You need to put randomness into the beginning of the key, if you put it other > than into the beginning, it's not guaranteed that you're going to have good > performance. > > The way we achieved this is by writing to HDFS first, and then having a > custom DistCp implemented using Spark that copies parquet files using random > keys, > and then saves the list of resulting keys to S3, and when we want to use > those parquet files, we just need to load the listing file, and then take > keys from it and pass them into the loader. > > You only need to do this when you have way too many files, if the number of > keys you operate is reasonably small (let's say, in thousands), you won't get > any benefits. > > Also the S3 buckets have internal optimizations, and overtime it adjusts to > the workload, i.e. some additional underlying partitions are getting added, > some splits happen, etc. > If you want to have good performance from start, you would need to use > randomization, yes. > Or alternatively, you can contact AWS and tell them about the naming schema > that you're going to have (but it must be set in stone), and then they can > try to pre-optimize the bucket for you. > >> On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram <subhash.sri...@gmail.com> >> wrote: >> Hey Spark user community, >> >> I am writing Parquet files from Spark to S3 using S3a. I was reading this >> article about improving S3 bucket performance, specifically about how it can >> help to introduce randomness to your key names so that data is written to >> different partitions. >> >> https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/ >> >> Is there a straight forward way to accomplish this randomness in Spark via >> the DataSet API? The only thing that I could think of would be to actually >> split the large set into multiple sets (based on row boundaries), and then >> write each one with the random key name. >> >> Is there an easier way that I am missing? >> >> Thanks in advance! >> Subhash >> >> >