were seeing discrepancy in query execution time on S3 with Spark 3.0.0.
Thanks and Regards,
Abhishek
From: Gourav Sengupta
Sent: Wednesday, August 26, 2020 5:49 PM
To: Rao, Abhishek (Nokia - IN/Bangalore)
Cc: user
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries
gt;
>
> Thanks and Regards,
>
> Abhishek
>
>
>
> *From:* Gourav Sengupta
> *Sent:* Wednesday, August 26, 2020 2:35 PM
> *To:* Rao, Abhishek (Nokia - IN/Bangalore)
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark 3.0 using S3 taking long time for some set
ngupta
mailto:gourav.sengu...@gmail.com>>
Sent: Wednesday, August 26, 2020 1:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore)
mailto:abhishek@nokia.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC
gt;
>
>
> *From:* Gourav Sengupta
> *Sent:* Wednesday, August 26, 2020 1:18 PM
> *To:* Rao, Abhishek (Nokia - IN/Bangalore)
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark 3.0 using S3 taking long time for some set of TPC DS
> Queries
>
>
>
> Hi,
>
>
>
>
Hi Gourav,
Yes. We’re using s3a.
Thanks and Regards,
Abhishek
From: Gourav Sengupta
Sent: Wednesday, August 26, 2020 1:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore)
Cc: user@spark.apache.org
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries
Hi,
are you using
Hi,
are you using s3a, which is not using EMRFS? In that case, these results
does not make sense to me.
Regards,
Gourav Sengupta
On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) <
abhishek@nokia.com> wrote:
> Hi All,
>
>
>
> We’re doing some performance comparisons betw
GB of data whereas in case of HDFS, it
is only 4.5 GB.
Any idea why this difference is there?
Thanks and Regards,
Abhishek
From: Luca Canali
Sent: Monday, August 24, 2020 7:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore)
Cc: user@spark.apache.org
Subject: RE: Spark 3.0 using S3 taking long time fo
Hi Abhishek,
Just a few ideas/comments on the topic:
When benchmarking/testing I find it useful to collect a more complete view of
resources usage and Spark metrics, beyond just measuring query elapsed time.
Something like this:
https://github.com/cerndb/spark-dashboard
I'd rather not use dyn