States get dropped in Structured Streaming

2020-10-22 Thread Eric Beabes
We're using Stateful Structured Streaming in Spark 2.4. We are noticing that when the load on the system is heavy & LOTs of messages are coming in some of the states disappear with no error message. Any suggestions on how we can debug this? Any tips for fixing this? Thanks in advance.

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Mich Talebzadeh
Thanks for the feedback Sean. Kind regards, Mich LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * *Disclaimer:* Use it at your own risk. Any and all responsibili

Re: Spark hive build and connectivity

2020-10-22 Thread Ravi Shankar
Thanks ! I have a very similar setup. I have built spark with -Phive which includes hive-2.3.7 jars , spark-hive*jars and some hadoop-common* jars. At runtime, i set SPARK_DIST_CLASSPATH=${hadoop classpath} and set spark.sql.hive.metastore.version and spark.sql.hive.metastore.jars to $HIVE_HOME/l

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Sean Owen
I don't find this trolling; I agree with the observation that 'the skills you have' are a valid and important determiner of what tools you pick. I disagree that you just have to pick the optimal tool for everything. Sounds good until that comes in contact with the real world. For Spark, Python vs S

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Gourav Sengupta
Hi Mich, this is turning into a troll now, can you please stop this? No one uses Scala where Python should be used, and no one uses Python where Scala should be used - it all depends on requirements. Everyone understands polyglot programming and how to use relevant technologies best to their adva

Re: Spark hive build and connectivity

2020-10-22 Thread Kimahriman
I have always been a little confused about the different hive-version integration as well. To expand on this question, we have a Hive 3.1.1 metastore that we can successfully interact with using the -Phive profile with Hive 2.3.7. We do not use the Hive 3.1.1 jars anywhere in our Spark applications

Re: Spark hive build and connectivity

2020-10-22 Thread Mich Talebzadeh
Hi, To access Hive tables Spark uses native API as below (default) where you have set-up ltr $SPARK_HOME/conf hive-site.xml -> /data6/hduser/hive-3.0.0/conf/hive-site.xml val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) HiveContext.sql("use ilayer") val account_table = HiveContext

Re: Spark hive build and connectivity

2020-10-22 Thread Artemis User
By default Spark will build with Hive 2.3.7, according to the Spark build doc.  If you want to replace it with a different hive jar, you need to change the Maven pom.xml file. -- ND On 10/22/20 11:35 AM, Ravi Shankar wrote: Hello all, I am trying to understand how the Spark SQL integration wi

Re: Spark hive build and connectivity

2020-10-22 Thread Ravi Shankar
Hello Mitch, I am just trying to access hive tables from my hive 3.2.1 cluster using spark. Basically i just want my spark-jobs to be able to access these hive tables. I want to understand how spark jobs interact with hive to access these tables. - I see that whenever i build spark with hive suppo

Re: Spark hive build and connectivity

2020-10-22 Thread Mich Talebzadeh
Hi Ravi, What exactly are you trying to do? You want to enhance Spark SQl or you want to run Hive on Spark engine? HTH LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Mich Talebzadeh
Today I had a discussion with a lead developer on a client site regarding Scala or PySpark. with Spark. They were not doing data science and reluctantly agreed that PySpark was used for ETL. In mitigation he mentioned that in his team he is the only one that is an expert on Scala (his words) and

Spark hive build and connectivity

2020-10-22 Thread Ravi Shankar
Hello all, I am trying to understand how the Spark SQL integration with hive works. Whenever i build spark with -Phive -P hive-thriftserver options, i see that it is packaged with hive-2.3.7*.jars and spark-hive*.jars. And the documentation claims that spark can talk to different versions of hive.