Hi Users,
We are getting updates in Kafka Topic(Through CDC). Can you please tell how
do I correct/replay/reprocess the late arriving records in Data lake?
Thanks
Rajat
v Sengupta
>>>>
>>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Well, you need to qualify your statement on data quality. Are you
>>>>> talking about data l
Hi Folks
Hoping you are doing well, I want to implement data quality to detect
issues in data in advance. I have heard about few frameworks like GE/Deequ.
Can anyone pls suggest which one is good and how do I get started on it?
Regards
Rajat
ark = SparkSession.builder.config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer"").getOrCreate
>
>
> rajat kumar 于2022年9月23日周五 05:58写道:
>
>> Hello Users,
>>
>> While using below setting getting exception
>> s
Hello Users,
While using below setting getting exception
spark.conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
User class threw exception: org.apache.spark.sql.AnalysisException: Cannot
modify the value of a Spark config: spark.serializer at
org.apache.spark.sql.error
park running code?
>
>
>
> *发件人**:* rajat kumar
> *日期**:* 星期二, 2022年9月20日 15:58
> *收件人**:* user @spark
> *主题**:* NoClassDefError and SparkSession should only be created and
> accessed on the driver.
>
> Hello ,
>
> I am using Spark3 where there are some UDFs along . I a
Hello ,
I am using Spark3 where there are some UDFs along . I am using Dataframe
APIs to write parquet using spark. I am getting NoClassDefError along with
below error.
If I comment out all UDFs , it is working fine.
Could someone suggest what could be wrong. It was working fine in Spark2.4
22/
Hello Users,
My 2 tasks are running forever. One of them gave a java heap space error.
I have 10 Joins , all tables are big. I understand this is data skewness.
Apart from changes at code level , any property which can be used in Spark
Config?
I am using Spark2 hence AQE can not be used.
Thank
Hello Users
Can some suggest what could be causing below error?
java.lang.RuntimeException: Error while decoding:
java.lang.NullPointerException: Null value appeared in non-nullable field:
- array element class: "scala.Long"
- root class: "scala.collection.Seq"
If the schema is inferred from a S
Hello Members,
We want to move to Spark 3 from Spark2.4 .
Are there any changes we need to do at code level which can break the
existing code?
Will it work by simply changing the version of spark & scala ?
Regards
Rajat
Hello Members,
I have a query for spark stages:-
why every stage has a different number of tasks/partitions in spark. Or how
is it determined?
Moreover, where can i see the improvements done in spark3+
Thanks in advance
Rajat
wrote:
> Spark does not use GPUs itself, but tasks you run on Spark can.
> The only 'support' there is is for requesting GPUs as resources for tasks,
> so it's just a question of resource management. That's in OSS.
>
> On Sat, Aug 13, 2022 at 8:16 AM rajat kumar
Hello,
I have been hearing about GPU in spark3.
For batch jobs , will it help to improve GPU performance. Also is GPU
support available only on Databricks or on cloud based Spark clusters ?
I am new , if anyone can share insight , it will help
Thanks
Rajat
, executor 9: java.lang.NoClassDefFoundError
(Could not initialize class com.raw.test$) [duplicate 12]
On Wed, Jul 20, 2022 at 10:36 PM rajat kumar
wrote:
> I did not set it explicitly while running on cluster and other jobs are
> also running fine , this conflict I have seen while readin
Hello , I am using maven with Spark. Post upgrading scala form 2.11 to 2.12
I am getting below error and have observed this coming while reading avro
Appreciate help.
ShuffleMapStage 6 (save at Calling.scala:81) failed in 0.633 s due to Job
aborted due to stage failure: Task 83 in stage 6.0 fail
Thanks a lot Sean
On Mon, Jul 18, 2022, 21:58 Sean Owen wrote:
> Increase the stack size for the JVM when Maven / SBT run. The build sets
> this but you may still need something like "-Xss4m" in your MAVEN_OPTS
>
> On Mon, Jul 18, 2022 at 11:18 AM rajat kumar
> wrot
Hello ,
Can anyone pls help me in below error. It is a maven project. It is coming
while building it
[ERROR] error: java.lang.StackOverflowError
[INFO] at
scala.tools.nsc.typechecker.Typers$Typer.typedApply$1(Typers.scala:4885)
Hello All
I am not getting anything in the logs and also history url is not opening.
Has someone faced this issue?
Application failed 1 times (global limit =5; local limit is =1) due to
ApplicationMaster for attempt timed out. Failing the application.
Thanks
Rajat
correctly for big files larger
> than memory by swapping them to disk.
>
> Thanks
>
> rajat kumar wrote:
> > Tested this with executors of size 5 cores, 17GB memory. Data vol is
> > really high around 1TB
>
> -
Tested this with executors of size 5 cores, 17GB memory. Data vol is really
high around 1TB
Thanks
Rajat
On Thu, Apr 7, 2022, 23:43 rajat kumar wrote:
> Hello Users,
>
> I got following error, tried increasing executor memory and memory
> overhead that also did not help .
>
Hello Users,
I got following error, tried increasing executor memory and memory overhead
that also did not help .
ExecutorLost Failure(executor1 exited caused by one of the following tasks)
Reason: container from a bad node:
java.lang.OutOfMemoryError: enough memory for aggregation
Can someone
Hello Users,
I am trying to create spark application using Scala(Intellij).
I have installed Scala plugin in intelliJ still getting below error:-
Cannot find project Scala library 2.12.12 for module SparkSimpleApp
Could anyone please help what I am doing wrong?
Thanks
Rajat
Hi Users,
Is there any use case when we need to use SQL vs Dataframe vs Dataset?
Is there any recommended approach or any advantage/performance gain over
others?
Thanks
Rajat
Hello ,
I know this might not be a valid use case for spark. But I have millions of
files in a single folder. file names are having a pattern. based on pattern
I want to move it to different directory.
Can you pls suggest what can be done?
Thanks
rajat
er yarn --deploy-mode client xyx.py
>>>
>>> What happens if you try running it in local mode?
>>>
>>> spark-submit --master local[2] xyx.py
>>>
>>> Is this run in a managed cluster like GCP dataproc?
>>>
>>> HTH
>>>
>
;
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no
Hi Team,
I am using Spark 2.4.4 with Python
While using below line:
dataframe.foreach(lambda record : process_logs(record))
My use case is , process logs will download the file from cloud storage
using Python code and then it will save the processed data.
I am getting the following error
F
Hi
Has anyone used kubernetes with spark for configmap.
My spark job is not able to find configmap.
Can someone pls share the yaml if u have used configmap for google k8s
Thanks
Rajat
skew techniques
> to repartition your data properly or if you are in spark 3.0+ try the
> skewJoin optimization.
>
> On Tue, 26 Jan 2021 at 11:20, rajat kumar
> wrote:
>
>> Hi Everyone,
>>
>> I am running a spark application where I have applied 2 left joins. 1st
>
Hi Everyone,
I am running a spark application where I have applied 2 left joins. 1st
join in Broadcast and another one is normal.
Out of 200 tasks , last 1 task is stuck . It is running at "ANY" Locality
level. It seems data skewness issue.
It is doing too much spill and shuffle write is too much.
Hi,
I want to apply custom logic for each row of data I am getting through
kafka and want to do it with microbatch.
When I am running it , it is not progressing.
kafka_stream_df \
.writeStream \
.foreach(process_records) \
.outputMode("append") \
.option("checkpoi
damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun,
Hello,
Can anyone confirm here please?
Regards
Rajat
On Sat, Jan 16, 2021 at 11:46 PM rajat kumar
wrote:
> Hey Users,
>
> I want to run spark job from virtual environment using Python.
>
> Please note I am creating virtual env (using python3 -m venv env)
>
> I see that
Hey Users,
I want to run spark job from virtual environment using Python.
Please note I am creating virtual env (using python3 -m venv env)
I see that there are 3 variables for PYTHON which we have to set:
PYTHONPATH
PYSPARK_DRIVER_PYTHON
PYSPARK_PYTHON
I have 2 doubts:
1. If i want to use Virt
Hey All
I am facing this error while running spark on kubernetes, can anyone
suggest what can be corrected here?
I am using minikube and spark 2.4 to run a spark submit with cluster mode.
default-scheduler 0/1 nodes are available: 1 Insufficient cpu.
Regards
Rajat
Hello Everyone,
My spark streaming job is running too slow, it is having batch time of 15
seconds and the batch gets completed in 20-22 secs. It was fine till 1st
week October, but it is behaving this way suddenly. I know changing the
batch time can help , but other than that any idea what can be
Hi All,
I have to call Oracle sequence using spark. Can you pls tell what is the
way to do that?
Thanks
Rajat
Hi All,
My spark sql job produces output as per default partition and creates N
number of files.
I want to create each file as 100Mb sized in the final result.
how can I do it ?
thanks
rajat
Hi All,
I have heard that log4j will not able to work properly. I have been told to
use logger in scala code.
Is there any pointer for that?
Thanks for help in advance
rajat
Hi All,
How to overcome skewness issues in spark ?
I read that we can add some randomness to key column before join and remove
that random part after join.
is there any better way ? Above method seems to be a workaround.
thanks
rajat
hello,
thanks for quick reply.
got it . partitionBy is to create something like hive partitions.
but when do we use repartition actually?
how to decide whether to do repartition or not?
because in development we are getting sample data.
also what number should I give while repartition.
thanks
On
Hi All,
Can anyone explain?
thanks
rajat
On Sun, 21 Apr 2019, 00:18 kumar.rajat20del Hi Spark Users,
>
> repartition and partitionBy seems to be very same in Df.
> In which scenario we use one?
>
> As per my understanding repartition is very expensive operation as it needs
> full shuffle then wh
Hi Yeikel,
I can not copy anything from the system.
But I have seen explain output.
It was doing sortMergeJoin for all tables.
There are 10 tables , all of them doing left outer join.
Out of 10 tables, 1 table is of 50MB and second table is of 200MB. Rest are
big tables.
Also the data is in Avr
Hi,
Can anyone pls explain ?
On Mon, 15 Apr 2019, 09:31 rajat kumar Hi All,
>
> I came across different parameters in spark submit
>
> --jars , --spark.executor.extraClassPath , --spark.driver.extraClassPath
>
> What are the differences between them? When to use which one? Wil
Hi ,
Thanks for response!
We are doing 12 left outer joins. Also I see GC is colored as red in Spark
UI. It seems GC is also taking time.
We have tried using kyro serialization. Tried giving more memory to
executor as well as driver. But it didn't work.
On Wed, 17 Apr 2019, 23:35 Yeikel W
Hi All,
One of my containers is still running for long time.
In logs it is showing "Thread 240 spilling sort data of 10.4 GB to disk".
This is happening every minute.
Thanks
Rajat
Hi All,
I came across different parameters in spark submit
--jars , --spark.executor.extraClassPath , --spark.driver.extraClassPath
What are the differences between them? When to use which one? Will it differ
if I use following:
--master yarn --deploy-mode client
--master yarn --deploy-mode clu
t; http://blog.jaceklaskowski.pl
>> Mastering Spark
>> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
>> Follow me at https://twitter.com/jaceklaskowski
>> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>>
>>
>> On Tue
Hi
i have a javaPairRdd rdd1. i want to group by rdd1 by keys but
preserve the partitions of original rdd only to avoid shuffle since I know
all same keys are already in same partition.
PairRdd is basically constrcuted using kafka streaming low level consumer
which have all records with same key
49 matches
Mail list logo