Re: Ordering pushdown for Spark Datasources

2021-04-06 Thread Mich Talebzadeh
Lucene. I came across it years ago. Does Lucene support JDBC connection at all? How about Solr? HTH view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction o

jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Mich Talebzadeh
Hi, Any chance of someone testing the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works

Re: Spark Structured Streaming with PySpark throwing error in execution

2021-04-06 Thread Mich Talebzadeh
Hi all, Following the upgrade to 3.1.1, I see a couple of issues. Spark Structured Streaming (SSS) does not seem to work with the newer spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;

unsubscribe

2021-04-06 Thread Latha Appanna

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Gabor Somogyi
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only. KafkaTokenUtil is in the token provider jar. As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently. BR, G On Tue, Apr 6, 2021 at 10:21 AM Mich Ta

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Gabor Somogyi
I've just had a deeper look at the possible issue and here are my findings: * In 3.0.1 KafkaTokenUtil.needTokenUpdate has 3 params * In 3.1.1 KafkaTokenUtil.needTokenUpdate has 2 params * I've decompiled spark-token-provider-kafka-0-10_2.12-3.1.1.jar and KafkaTokenUtil.needTokenUpdate has 2 params

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Mich Talebzadeh
Thanks Gabor. All nodes are running Spark /spark-3.1.1-bin-hadoop3.2 So $SPARK_HOME/jars contains all the required jars on all nodes including the jar file commons-pool2-2.9.0.jar as well. They are installed identically on all nodes. I have looked at the Spark environment for classpath. Still I

Dynamic Allocation Backlog Property in Spark on Kubernetes

2021-04-06 Thread Ranju Jain
Hi All, I have set dynamic allocation enabled while running spark on Kubernetes . But new executors are requested if pending tasks are backlogged for more than configured duration in property "spark.dynamicAllocation.schedulerBacklogTimeout". My Use Case is: There are number of parallel jobs

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Sean Owen
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1. You do not in general modify the Spark libs. You need to package libs like this with your app at the correct version. On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh wrote: > Thanks Gabor. > > All nodes are running Spark /s

Re: Tuning spark job to make count faster.

2021-04-06 Thread Sean Owen
Hard to say without a lot more info, but 76.5K tasks is very large. How big are the tasks / how long do they take? if very short, you should repartition down. Do you end up with 800 executors? if so why 2 per machine? that generally is a loss at this scale of worker. I'm confused because you have 4

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Mich Talebzadeh
OK thanks for that. I am using spark-submit with PySpark as follows spark-submit --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.1 /_/ Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit Se

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Gabor Somogyi
> Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even Please see how Structured Streaming app with Kafka needs to be deployed here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying I don't see the --packag

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Mich Talebzadeh
Hi G Thanks for the heads-up. In a thread on 3rd of March I reported that 3.1.1 works in yarn mode Spark 3.1.1 Preliminary results (mainly to do with Spark Structured Streaming) (mail-archive.com) >From that mail The needed ja

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Sean Owen
Gabor's point is that these are not libraries you typically install in your cluster itself. You package them with your app. On Tue, Apr 6, 2021 at 11:35 AM Mich Talebzadeh wrote: > Hi G > > Thanks for the heads-up. > > In a thread on 3rd of March I reported that 3.1.1 works in yarn mode > > Spar

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Mich Talebzadeh
Fine. Just to clarify please. With SBT assembly and Scala I would create an Uber jar file and used that one with spark-submit As I understand (and stand corrected) with PySpark one can only run spark-submit in client mode by directly using a py file? So hence spark-submit --master local[4] --p

Spark performance over S3

2021-04-06 Thread Tzahi File
Hi All, We have a spark cluster on aws ec2 that has 60 X i3.4xlarge. The spark job running on that cluster reads from an S3 bucket and writes to that bucket. the bucket and the ec2 run in the same region. As part of our efforts to reduce the runtime of our spark jobs we found there's serious la

Re: Spark performance over S3

2021-04-06 Thread Gourav Sengupta
Hi Tzahi, that is a huge cost. So that I can understand the question before answering it: 1. what is the SPARK version that you are using? 2. what is the SQL code that you are using to read and write? There are several other questions that are pertinent, but the above will be a great starting poi

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Mich Talebzadeh
OK we found out the root cause of this issue. We were writing to Redis from Spark and downloaded a recently compiled version of Redis jar with scala 2.12. spark-redis_2.12-2.4.1-SNAPSHOT-jar-with-dependencies.jar It was giving grief. We removed that one. So the job runs with either spark-sql-ka

RE: Spark performance over S3

2021-04-06 Thread Boris Litvak
Hi Tzahi, I don’t know the reasons for that, though I’d check for fs.s3a implementation to be using multipart uploads, which I assume it does. I would say that none of the comments in the link are relevant to you, as the VPC endpoint is more of a security rather than performance feature. I got

Data Lakes using Spark

2021-04-06 Thread Boris Litvak
Hi Friends, I’d like to publish a document to Medium about data lakes using Spark. Its latter parts include info that is not widely known, unless you have experience with data lakes. https://github.com/borislitvak/datalake-article/blob/initial_comments/Building%20a%20Real%20Life%20Data%20Lake%20

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Gabor Somogyi
Good to hear it's working. Happy Spark usage. G On Tue, 6 Apr 2021, 21:56 Mich Talebzadeh, wrote: > OK we found out the root cause of this issue. > > We were writing to Redis from Spark and downloaded a recently compiled > version of Redis jar with scala 2.12. > > spark-redis_2.12-2.4.1-SNAPSH