Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Gourav Sengupta Mon, 07 Mar 2022 05:37:03 -0800

Hi,

are all users using the same cluster of data proc?


Regards,
Gourav

On Mon, Mar 7, 2022 at 9:28 AM Saurabh Gulati <saurabh.gul...@fedex.com>
wrote:

> Thanks for the response, Gourav.
>
> Queries range from simple to large joins. We expose the data to our
> analytics users so that they can develop their models and they use superset
> as the SQL interface for testing.
>
> Hive-metastore will *not* do a full scan *if* we specify the partitioning
> column.
> But that's something users might/do forget, so we were thinking of
> enforcing a way to make sure people *do* specify partitioning column in
> their queries.
>
> The only way we see for now is to parse the query in superset to check if
> partition column is being used. But we are not sure of a way which will
> work for all types of queries.
>
> For example, we can parse the SQL and see if count (where) == count(
> partition_column ), but this may not work for complex queries.
>
>
> Regards
> Saurabh
> ------------------------------
> *From:* Gourav Sengupta <gourav.sengu...@gmail.com>
> *Sent:* 05 March 2022 11:06
> *To:* Saurabh Gulati <saurabh.gul...@fedex.com.invalid>
> *Cc:* Mich Talebzadeh <mich.talebza...@gmail.com>; Kidong Lee <
> mykid...@gmail.com>; user@spark.apache.org <user@spark.apache.org>
> *Subject:* Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in
> Spark SQL
>
> Hi,
>
> I completely agree with Saurabh, the use of BQ with SPARK does not make
> sense at all, if you are trying to cut down your costs. I think that costs
> do matter to a few people at the end.
>
> Saurabh, is there any chance you can see what actual queries are hitting
> the thrift server? Using hive metastore is something that I have been doing
> in AWS EMR for the last 5 years and for sure it does not cause full table
> scan.
>
> Hi Sean,
> for some reason, I am not able to receive any emails from the spark user
> group. My account should be a very old one, is there any chance you can
> kindly have a look into it and kindly let me know if there is something
> blocking me? I will be sincerely obliged.
>
> Regards,
> Gourav Sengupta
>
>
> On Tue, Feb 22, 2022 at 3:58 PM Saurabh Gulati
> <saurabh.gul...@fedex.com.invalid> wrote:
>
> Hey Mich,
> We use spark 3.2 now. We are using BQ but migrating away because:
>
>    - Its not reflective of our current lake structure with all
>    deltas/history tables/models outputs etc
>    - Its pretty expensive to load everything in BQ and essentially it
>    will be a copy of all data in gcs. External tables in BQ didnt work for us.
>    Currently we store only latest snapshots in BQ. This breaks idempotency of
>    models which need to time travel and run in the past.
>    - We might move to a different cloud provider in future so we want to
>    be cloud agnostic.
>
> So we need to have an execution engine which has the same overview of data
> as we have in gcs.
> We tried presto but performance was similar and presto didn't support auto
> scaling.
>
> TIA
> Saurabh
> ------------------------------
> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
> *Sent:* 22 February 2022 16:49
> *To:* Kidong Lee <mykid...@gmail.com>; Saurabh Gulati <
> saurabh.gul...@fedex.com>
> *Cc:* user@spark.apache.org <user@spark.apache.org>
> *Subject:* Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in
> Spark SQL
>
> Ok interesting.
>
> I am surprised why you are not using BigQuery and using Hive. My
> assumption is that your Spark is version 3.1.1 with standard GKE on
> auto-scaler. What benefits are you getting from Using Hive here? As you
> have your hive tables on gs buckets, you can easily download your hive
> tables into BigQuery and run spark on BigQuery?
>
> HTH
>
> On Tue, 22 Feb 2022 at 15:34, Saurabh Gulati <saurabh.gul...@fedex.com>
> wrote:
>
> Thanks Sean for your response.
>
> @Mich Talebzadeh <mich.talebza...@gmail.com> We run all workloads on GKE
> as docker containers. So to answer your questions, Hive is running in a
> container as K8S service and spark thrift-server in another container as a
> service and Superset in a third container.
>
> We use Spark on GKE setup to run thrift-server which spawns workers
> depending on the load. For buckets we use gcs.
>
>
> TIA
> Saurabh
> ------------------------------
> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
> *Sent:* 22 February 2022 16:05
> *To:* Saurabh Gulati <saurabh.gul...@fedex.com.invalid>
> *Cc:* user@spark.apache.org <user@spark.apache.org>
> *Subject:* [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark
> SQL
>
> *Caution! This email originated outside of FedEx. Please do not open
> attachments or click links from an unknown or suspicious origin*.
> Is your hive on prem with external tables in cloud storage?
>
> Where is your spark running from and what cloud buckets are you using?
>
> HTH
>
> On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati
> <saurabh.gul...@fedex.com.invalid> wrote:
>
> Hello,
> We are trying to setup Spark as the execution engine for exposing our data
> stored in lake. We have hive metastore running along with Spark thrift
> server and are using Superset as the UI.
>
> We save all tables as External tables in hive metastore with storge being
> on Cloud.
>
> We see that right now when users run a query in Superset SQL Lab it scans
> the whole table. What we want is to limit the data scan by setting
> something like hive.mapred.mode=strict in spark, so that user gets an
> exception if they don't specify a partition column.
>
> We tried setting spark.hadoop.hive.mapred.mode=strict in
> spark-defaults.conf in thrift server  but it still scans the whole table.
> Also tried setting hive.mapred.mode=strict in hive-defaults.conf for
> metastore container.
>
> We use Spark 3.2 with hive-metastore version 3.1.2
>
> Is there a way in spark settings to make it happen.
>
>
> TIA
> Saurabh
>
> --
>
>
>
>    view my Linkedin profile
> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> --
>
>
>
>    view my Linkedin profile
> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mqVIOOMCo$>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mq9Bt4EX4$>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Reply via email to