Hi Frank We have thousands of small files . Each file is between 6K to maybe 100k.
Conductor looks interesting Andy From: Frank Austin Nothaft <fnoth...@berkeley.edu> Date: Tuesday, March 15, 2016 at 11:59 AM To: Andrew Davidson <a...@santacruzintegration.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: newbie HDFS S3 best practices > Hard to say with #1 without knowing your application¹s characteristics; for > #2, we use conductor <https://github.com/BD2KGenomics/conductor> with IAM > roles, .boto/.aws/credentials files. > > Frank Austin Nothaft > fnoth...@berkeley.edu > fnoth...@eecs.berkeley.edu > 202-340-0466 > >> On Mar 15, 2016, at 11:45 AM, Andy Davidson <a...@santacruzintegration.com> >> wrote: >> >> We use the spark-ec2 script to create AWS clusters as needed (we do not use >> AWS EMR) >> 1. will we get better performance if we copy data to HDFS before we run >> instead of reading directly from S3? >> 2. What is a good way to move results from HDFS to S3? >> >> >> It seems like there are many ways to bulk copy to s3. Many of them require we >> explicitly use the AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@ >> <mailto:AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt> . This seems like a >> bad idea? >> >> What would you recommend? >> >> Thanks >> >> Andy >> >> >