Hi, are all users using the same cluster of data proc?
Regards, Gourav On Mon, Mar 7, 2022 at 9:28 AM Saurabh Gulati <saurabh.gul...@fedex.com> wrote: > Thanks for the response, Gourav. > > Queries range from simple to large joins. We expose the data to our > analytics users so that they can develop their models and they use superset > as the SQL interface for testing. > > Hive-metastore will *not* do a full scan *if* we specify the partitioning > column. > But that's something users might/do forget, so we were thinking of > enforcing a way to make sure people *do* specify partitioning column in > their queries. > > The only way we see for now is to parse the query in superset to check if > partition column is being used. But we are not sure of a way which will > work for all types of queries. > > For example, we can parse the SQL and see if count (where) == count( > partition_column ), but this may not work for complex queries. > > > Regards > Saurabh > ------------------------------ > *From:* Gourav Sengupta <gourav.sengu...@gmail.com> > *Sent:* 05 March 2022 11:06 > *To:* Saurabh Gulati <saurabh.gul...@fedex.com.invalid> > *Cc:* Mich Talebzadeh <mich.talebza...@gmail.com>; Kidong Lee < > mykid...@gmail.com>; user@spark.apache.org <user@spark.apache.org> > *Subject:* Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in > Spark SQL > > Hi, > > I completely agree with Saurabh, the use of BQ with SPARK does not make > sense at all, if you are trying to cut down your costs. I think that costs > do matter to a few people at the end. > > Saurabh, is there any chance you can see what actual queries are hitting > the thrift server? Using hive metastore is something that I have been doing > in AWS EMR for the last 5 years and for sure it does not cause full table > scan. > > Hi Sean, > for some reason, I am not able to receive any emails from the spark user > group. My account should be a very old one, is there any chance you can > kindly have a look into it and kindly let me know if there is something > blocking me? I will be sincerely obliged. > > Regards, > Gourav Sengupta > > > On Tue, Feb 22, 2022 at 3:58 PM Saurabh Gulati > <saurabh.gul...@fedex.com.invalid> wrote: > > Hey Mich, > We use spark 3.2 now. We are using BQ but migrating away because: > > - Its not reflective of our current lake structure with all > deltas/history tables/models outputs etc > - Its pretty expensive to load everything in BQ and essentially it > will be a copy of all data in gcs. External tables in BQ didnt work for us. > Currently we store only latest snapshots in BQ. This breaks idempotency of > models which need to time travel and run in the past. > - We might move to a different cloud provider in future so we want to > be cloud agnostic. > > So we need to have an execution engine which has the same overview of data > as we have in gcs. > We tried presto but performance was similar and presto didn't support auto > scaling. > > TIA > Saurabh > ------------------------------ > *From:* Mich Talebzadeh <mich.talebza...@gmail.com> > *Sent:* 22 February 2022 16:49 > *To:* Kidong Lee <mykid...@gmail.com>; Saurabh Gulati < > saurabh.gul...@fedex.com> > *Cc:* user@spark.apache.org <user@spark.apache.org> > *Subject:* Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in > Spark SQL > > Ok interesting. > > I am surprised why you are not using BigQuery and using Hive. My > assumption is that your Spark is version 3.1.1 with standard GKE on > auto-scaler. What benefits are you getting from Using Hive here? As you > have your hive tables on gs buckets, you can easily download your hive > tables into BigQuery and run spark on BigQuery? > > HTH > > On Tue, 22 Feb 2022 at 15:34, Saurabh Gulati <saurabh.gul...@fedex.com> > wrote: > > Thanks Sean for your response. > > @Mich Talebzadeh <mich.talebza...@gmail.com> We run all workloads on GKE > as docker containers. So to answer your questions, Hive is running in a > container as K8S service and spark thrift-server in another container as a > service and Superset in a third container. > > We use Spark on GKE setup to run thrift-server which spawns workers > depending on the load. For buckets we use gcs. > > > TIA > Saurabh > ------------------------------ > *From:* Mich Talebzadeh <mich.talebza...@gmail.com> > *Sent:* 22 February 2022 16:05 > *To:* Saurabh Gulati <saurabh.gul...@fedex.com.invalid> > *Cc:* user@spark.apache.org <user@spark.apache.org> > *Subject:* [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark > SQL > > *Caution! This email originated outside of FedEx. Please do not open > attachments or click links from an unknown or suspicious origin*. > Is your hive on prem with external tables in cloud storage? > > Where is your spark running from and what cloud buckets are you using? > > HTH > > On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati > <saurabh.gul...@fedex.com.invalid> wrote: > > Hello, > We are trying to setup Spark as the execution engine for exposing our data > stored in lake. We have hive metastore running along with Spark thrift > server and are using Superset as the UI. > > We save all tables as External tables in hive metastore with storge being > on Cloud. > > We see that right now when users run a query in Superset SQL Lab it scans > the whole table. What we want is to limit the data scan by setting > something like hive.mapred.mode=strict in spark, so that user gets an > exception if they don't specify a partition column. > > We tried setting spark.hadoop.hive.mapred.mode=strict in > spark-defaults.conf in thrift server but it still scans the whole table. > Also tried setting hive.mapred.mode=strict in hive-defaults.conf for > metastore container. > > We use Spark 3.2 with hive-metastore version 3.1.2 > > Is there a way in spark settings to make it happen. > > > TIA > Saurabh > > -- > > > > view my Linkedin profile > <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$> > > > https://en.everybodywiki.com/Mich_Talebzadeh > <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > -- > > > > view my Linkedin profile > <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mqVIOOMCo$> > > > https://en.everybodywiki.com/Mich_Talebzadeh > <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mq9Bt4EX4$> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > >