On 10 Aug 2015, at 20:17, Akshat Aranya
mailto:aara...@gmail.com>> wrote:
Hi Jerry, Akhil,
Thanks your your help. With s3n, the entire file is downloaded even while just
creating the RDD with sqlContext.read.parquet(). It seems like even just
opening and closing the InputStream causes the en
Hi Jerry, Akhil,
Thanks your your help. With s3n, the entire file is downloaded even while
just creating the RDD with sqlContext.read.parquet(). It seems like even
just opening and closing the InputStream causes the entire data to get
fetched.
As it turned out, I was able to use s3a and avoid th
Hi Akshat,
I find some open source library which implements S3 InputFormat for Hadoop.
Then I use Spark newAPIHadoopRDD to load data via that S3 InputFormat.
The open source library is https://github.com/ATLANTBH/emr-s3-io. It is a
little old. I look inside it and make some changes. Then it works
Hi Akshat,
Is there a particular reason you don't use s3a? From my experience,s3a performs
much better than the rest. I believe the inefficiency is from the
implementation of the s3 interface.
Best Regards,
Jerry
Sent from my iPhone
> On 9 Aug, 2015, at 5:48 am, Akhil Das wrote:
>
> Depend
Depends on which operation you are doing, If you are doing a .count() on a
parquet, it might not download the entire file i think, but if you do a
.count() on a normal text file it might pull the entire file.
Thanks
Best Regards
On Sat, Aug 8, 2015 at 3:12 AM, Akshat Aranya wrote:
> Hi,
>
> I'v