Late arriving updates to fact tables

2023-02-25 Thread rajat kumar
Hi Users, We are getting updates in Kafka Topic(Through CDC). Can you please tell how do I correct/replay/reprocess the late arriving records in Data lake? Thanks Rajat

Re: Profiling data quality with Spark

2022-12-28 Thread rajat kumar
v Sengupta >>>> >>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> Well, you need to qualify your statement on data quality. Are you >>>>> talking about data l

Profiling data quality with Spark

2022-12-27 Thread rajat kumar
Hi Folks Hoping you are doing well, I want to implement data quality to detect issues in data in advance. I have heard about few frameworks like GE/Deequ. Can anyone pls suggest which one is good and how do I get started on it? Regards Rajat

Re: Kyro Serializer not getting set : Spark3

2022-09-23 Thread rajat kumar
ark = SparkSession.builder.config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer"").getOrCreate > > > rajat kumar 于2022年9月23日周五 05:58写道: > >> Hello Users, >> >> While using below setting getting exception >> s

Kyro Serializer not getting set : Spark3

2022-09-22 Thread rajat kumar
Hello Users, While using below setting getting exception spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") User class threw exception: org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark config: spark.serializer at org.apache.spark.sql.error

Re: NoClassDefError and SparkSession should only be created and accessed on the driver.

2022-09-20 Thread rajat kumar
park running code? > > > > *发件人**:* rajat kumar > *日期**:* 星期二, 2022年9月20日 15:58 > *收件人**:* user @spark > *主题**:* NoClassDefError and SparkSession should only be created and > accessed on the driver. > > Hello , > > I am using Spark3 where there are some UDFs along . I a

NoClassDefError and SparkSession should only be created and accessed on the driver.

2022-09-20 Thread rajat kumar
Hello , I am using Spark3 where there are some UDFs along . I am using Dataframe APIs to write parquet using spark. I am getting NoClassDefError along with below error. If I comment out all UDFs , it is working fine. Could someone suggest what could be wrong. It was working fine in Spark2.4 22/

Long running task in spark

2022-09-11 Thread rajat kumar
Hello Users, My 2 tasks are running forever. One of them gave a java heap space error. I have 10 Joins , all tables are big. I understand this is data skewness. Apart from changes at code level , any property which can be used in Spark Config? I am using Spark2 hence AQE can not be used. Thank

Data Type Issue while upgrading to Spark3

2022-09-02 Thread rajat kumar
Hello Users Can some suggest what could be causing below error? java.lang.RuntimeException: Error while decoding: java.lang.NullPointerException: Null value appeared in non-nullable field: - array element class: "scala.Long" - root class: "scala.collection.Seq" If the schema is inferred from a S

Moving to Spark 3x from Spark2

2022-09-01 Thread rajat kumar
Hello Members, We want to move to Spark 3 from Spark2.4 . Are there any changes we need to do at code level which can break the existing code? Will it work by simply changing the version of spark & scala ? Regards Rajat

deciding Spark tasks & optimization resource

2022-08-29 Thread rajat kumar
Hello Members, I have a query for spark stages:- why every stage has a different number of tasks/partitions in spark. Or how is it determined? Moreover, where can i see the improvements done in spark3+ Thanks in advance Rajat

Re: Spark with GPU

2022-08-13 Thread rajat kumar
wrote: > Spark does not use GPUs itself, but tasks you run on Spark can. > The only 'support' there is is for requesting GPUs as resources for tasks, > so it's just a question of resource management. That's in OSS. > > On Sat, Aug 13, 2022 at 8:16 AM rajat kumar

Spark with GPU

2022-08-13 Thread rajat kumar
Hello, I have been hearing about GPU in spark3. For batch jobs , will it help to improve GPU performance. Also is GPU support available only on Databricks or on cloud based Spark clusters ? I am new , if anyone can share insight , it will help Thanks Rajat

Re: Dependencies issue in spark

2022-07-20 Thread rajat kumar
, executor 9: java.lang.NoClassDefFoundError (Could not initialize class com.raw.test$) [duplicate 12] On Wed, Jul 20, 2022 at 10:36 PM rajat kumar wrote: > I did not set it explicitly while running on cluster and other jobs are > also running fine , this conflict I have seen while readin

Dependencies issue in spark

2022-07-20 Thread rajat kumar
Hello , I am using maven with Spark. Post upgrading scala form 2.11 to 2.12 I am getting below error and have observed this coming while reading avro Appreciate help. ShuffleMapStage 6 (save at Calling.scala:81) failed in 0.633 s due to Job aborted due to stage failure: Task 83 in stage 6.0 fail

Re: Issue while building spark project

2022-07-19 Thread rajat kumar
Thanks a lot Sean On Mon, Jul 18, 2022, 21:58 Sean Owen wrote: > Increase the stack size for the JVM when Maven / SBT run. The build sets > this but you may still need something like "-Xss4m" in your MAVEN_OPTS > > On Mon, Jul 18, 2022 at 11:18 AM rajat kumar > wrot

Issue while building spark project

2022-07-18 Thread rajat kumar
Hello , Can anyone pls help me in below error. It is a maven project. It is coming while building it [ERROR] error: java.lang.StackOverflowError [INFO] at scala.tools.nsc.typechecker.Typers$Typer.typedApply$1(Typers.scala:4885)

Spark job failing and not giving error to do diagnosis

2022-04-23 Thread rajat kumar
Hello All I am not getting anything in the logs and also history url is not opening. Has someone faced this issue? Application failed 1 times (global limit =5; local limit is =1) due to ApplicationMaster for attempt timed out. Failing the application. Thanks Rajat

Re: Executorlost failure

2022-04-07 Thread rajat kumar
correctly for big files larger > than memory by swapping them to disk. > > Thanks > > rajat kumar wrote: > > Tested this with executors of size 5 cores, 17GB memory. Data vol is > > really high around 1TB > > -

Re: Executorlost failure

2022-04-07 Thread rajat kumar
Tested this with executors of size 5 cores, 17GB memory. Data vol is really high around 1TB Thanks Rajat On Thu, Apr 7, 2022, 23:43 rajat kumar wrote: > Hello Users, > > I got following error, tried increasing executor memory and memory > overhead that also did not help . >

Executorlost failure

2022-04-07 Thread rajat kumar
Hello Users, I got following error, tried increasing executor memory and memory overhead that also did not help . ExecutorLost Failure(executor1 exited caused by one of the following tasks) Reason: container from a bad node: java.lang.OutOfMemoryError: enough memory for aggregation Can someone

Issue while creating spark app

2022-02-26 Thread rajat kumar
Hello Users, I am trying to create spark application using Scala(Intellij). I have installed Scala plugin in intelliJ still getting below error:- Cannot find project Scala library 2.12.12 for module SparkSimpleApp Could anyone please help what I am doing wrong? Thanks Rajat

SparkSQL vs Dataframe vs Dataset

2021-12-06 Thread rajat kumar
Hi Users, Is there any use case when we need to use SQL vs Dataframe vs Dataset? Is there any recommended approach or any advantage/performance gain over others? Thanks Rajat

Moving millions of file using spark

2021-06-16 Thread rajat kumar
Hello , I know this might not be a valid use case for spark. But I have millions of files in a single folder. file names are having a pattern. based on pattern I want to move it to different directory. Can you pls suggest what can be done? Thanks rajat

Re: Issue while calling foreach in Pyspark

2021-05-08 Thread rajat kumar
er yarn --deploy-mode client xyx.py >>> >>> What happens if you try running it in local mode? >>> >>> spark-submit --master local[2] xyx.py >>> >>> Is this run in a managed cluster like GCP dataproc? >>> >>> HTH >>> >

Re: Issue while calling foreach in Pyspark

2021-05-07 Thread rajat kumar
; >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no

Issue while calling foreach in Pyspark

2021-05-07 Thread rajat kumar
Hi Team, I am using Spark 2.4.4 with Python While using below line: dataframe.foreach(lambda record : process_logs(record)) My use case is , process logs will download the file from cloud storage using Python code and then it will save the processed data. I am getting the following error F

Yaml for google spark kubernetes configmap

2021-03-03 Thread rajat kumar
Hi Has anyone used kubernetes with spark for configmap. My spark job is not able to find configmap. Can someone pls share the yaml if u have used configmap for google k8s Thanks Rajat

Re: Thread spilling sort issue with single task

2021-01-26 Thread rajat kumar
skew techniques > to repartition your data properly or if you are in spark 3.0+ try the > skewJoin optimization. > > On Tue, 26 Jan 2021 at 11:20, rajat kumar > wrote: > >> Hi Everyone, >> >> I am running a spark application where I have applied 2 left joins. 1st >

Thread spilling sort issue with single task

2021-01-26 Thread rajat kumar
Hi Everyone, I am running a spark application where I have applied 2 left joins. 1st join in Broadcast and another one is normal. Out of 200 tasks , last 1 task is stuck . It is running at "ANY" Locality level. It seems data skewness issue. It is doing too much spill and shuffle write is too much.

Process each kafka record for structured streaming

2021-01-20 Thread rajat kumar
Hi, I want to apply custom logic for each row of data I am getting through kafka and want to do it with microbatch. When I am running it , it is not progressing. kafka_stream_df \ .writeStream \ .foreach(process_records) \ .outputMode("append") \ .option("checkpoi

Re: Running pyspark job from virtual environment

2021-01-17 Thread rajat kumar
damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sun,

Re: Running pyspark job from virtual environment

2021-01-17 Thread rajat kumar
Hello, Can anyone confirm here please? Regards Rajat On Sat, Jan 16, 2021 at 11:46 PM rajat kumar wrote: > Hey Users, > > I want to run spark job from virtual environment using Python. > > Please note I am creating virtual env (using python3 -m venv env) > > I see that

Running pyspark job from virtual environment

2021-01-16 Thread rajat kumar
Hey Users, I want to run spark job from virtual environment using Python. Please note I am creating virtual env (using python3 -m venv env) I see that there are 3 variables for PYTHON which we have to set: PYTHONPATH PYSPARK_DRIVER_PYTHON PYSPARK_PYTHON I have 2 doubts: 1. If i want to use Virt

Kubernetes spark insufficient cpu error

2020-12-21 Thread rajat kumar
Hey All I am facing this error while running spark on kubernetes, can anyone suggest what can be corrected here? I am using minikube and spark 2.4 to run a spark submit with cluster mode. default-scheduler 0/1 nodes are available: 1 Insufficient cpu. Regards Rajat

Spark Streaming Job is stucked

2020-10-18 Thread rajat kumar
Hello Everyone, My spark streaming job is running too slow, it is having batch time of 15 seconds and the batch gets completed in 20-22 secs. It was fine till 1st week October, but it is behaving this way suddenly. I know changing the batch time can help , but other than that any idea what can be

Call Oracle Sequence using Spark

2019-08-15 Thread rajat kumar
Hi All, I have to call Oracle sequence using spark. Can you pls tell what is the way to do that? Thanks Rajat

write files of a specific size

2019-05-05 Thread rajat kumar
Hi All, My spark sql job produces output as per default partition and creates N number of files. I want to create each file as 100Mb sized in the final result. how can I do it ? thanks rajat

common logging in spark

2019-05-01 Thread rajat kumar
Hi All, I have heard that log4j will not able to work properly. I have been told to use logger in scala code. Is there any pointer for that? Thanks for help in advance rajat

handling skewness issues

2019-04-29 Thread rajat kumar
Hi All, How to overcome skewness issues in spark ? I read that we can add some randomness to key column before join and remove that random part after join. is there any better way ? Above method seems to be a workaround. thanks rajat

Re: repartition in df vs partitionBy in df

2019-04-24 Thread rajat kumar
hello, thanks for quick reply. got it . partitionBy is to create something like hive partitions. but when do we use repartition actually? how to decide whether to do repartition or not? because in development we are getting sample data. also what number should I give while repartition. thanks On

Re: repartition in df vs partitionBy in df

2019-04-24 Thread rajat kumar
Hi All, Can anyone explain? thanks rajat On Sun, 21 Apr 2019, 00:18 kumar.rajat20del Hi Spark Users, > > repartition and partitionBy seems to be very same in Df. > In which scenario we use one? > > As per my understanding repartition is very expensive operation as it needs > full shuffle then wh

Re: Spark job running for long time

2019-04-21 Thread rajat kumar
Hi Yeikel, I can not copy anything from the system. But I have seen explain output. It was doing sortMergeJoin for all tables. There are 10 tables , all of them doing left outer join. Out of 10 tables, 1 table is of 50MB and second table is of 200MB. Rest are big tables. Also the data is in Avr

Re: --jars vs --spark.executor.extraClassPath vs --spark.driver.extraClassPath

2019-04-20 Thread rajat kumar
Hi, Can anyone pls explain ? On Mon, 15 Apr 2019, 09:31 rajat kumar Hi All, > > I came across different parameters in spark submit > > --jars , --spark.executor.extraClassPath , --spark.driver.extraClassPath > > What are the differences between them? When to use which one? Wil

Re: Spark job running for long time

2019-04-17 Thread rajat kumar
Hi , Thanks for response! We are doing 12 left outer joins. Also I see GC is colored as red in Spark UI. It seems GC is also taking time. We have tried using kyro serialization. Tried giving more memory to executor as well as driver. But it didn't work. On Wed, 17 Apr 2019, 23:35 Yeikel W

Spark job running for long time

2019-04-17 Thread rajat kumar
Hi All, One of my containers is still running for long time. In logs it is showing "Thread 240 spilling sort data of 10.4 GB to disk". This is happening every minute. Thanks Rajat

--jars vs --spark.executor.extraClassPath vs --spark.driver.extraClassPath

2019-04-14 Thread rajat kumar
Hi All, I came across different parameters in spark submit --jars , --spark.executor.extraClassPath , --spark.driver.extraClassPath What are the differences between them? When to use which one? Will it differ if I use following: --master yarn --deploy-mode client --master yarn --deploy-mode clu

Re: spark rdd grouping

2015-12-01 Thread Rajat Kumar
t; http://blog.jaceklaskowski.pl >> Mastering Spark >> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ >> Follow me at https://twitter.com/jaceklaskowski >> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski >> >> >> On Tue

spark rdd grouping

2015-11-30 Thread Rajat Kumar
Hi i have a javaPairRdd rdd1. i want to group by rdd1 by keys but preserve the partitions of original rdd only to avoid shuffle since I know all same keys are already in same partition. PairRdd is basically constrcuted using kafka streaming low level consumer which have all records with same key