Hi Folks,
Have created a UDF that queries a confluent schema registry for a schema,
which is then used within a Dataset Select with the from_avro function to
decode an avro encoded value (reading from a bunch of kafka topics)
Dataset recordDF = df.select(
callUDF("getjsonSchemaUDF",col(
i have seen many jobs where spark re-uses shuffle files (and skips a stage
of a job), which is an awesome feature given how expensive shuffles are,
and i generally now assume this will happen.
however i feel like i am going a little crazy today. i did the simplest
test in spark 3.3.0, basically i
Spark can reuse shuffle stages in the same job (action), not cross jobs.
From: Koert Kuipers
Sent: Saturday, July 16, 2022 6:43 PM
To: user
Subject: [EXTERNAL] spark re-use shuffle files not happening
ATTENTION: This email originated from outside of GM.
i have
ok thanks. guess i am simply misremembering that i saw the shuffle files
getting re-used across jobs (actions). it was probably across stages for
the same job.
in structured streaming this is a pretty big deal. if you join a streaming
dataframe with a large static dataframe each microbatch becomes
Hi Shay,
Thanks for your reply! I would very much like to use pyspark. However, my
project depends on GraphX, which is only available in the Scala API as far
as I know. So I'm locked with Scala and trying to find a way out. I wonder
if there's a way to go around it.
Best regards,
Yuhao Zhang
On
Use GraphFrames?
On Sat, Jul 16, 2022 at 3:54 PM Yuhao Zhang wrote:
> Hi Shay,
>
> Thanks for your reply! I would very much like to use pyspark. However, my
> project depends on GraphX, which is only available in the Scala API as far
> as I know. So I'm locked with Scala and trying to find a way
Other alternatives are to look at how PythonRDD does it in spark, you could
also try to go for a more traditional setup where you expose your python
functions behind a local/remote service and call that from scala - say over
thrift/grpc/http/local socket etc.
Another option, but I've never done it
I'm curious about using shared memory to speed up the JVM->Python round
trip. Is there any sane way to do anonymous shared memory in Java/scale?
On Sat, Jul 16, 2022 at 16:10 Sebastian Piu wrote:
> Other alternatives are to look at how PythonRDD does it in spark, you
> could also try to go for a
Hi,
I am working on the project and as a resource I have provided 40 executors
and 14 gb memory per executor.
I am trying to optimize my spark job in such a way that spark will evenly
distribute the spark job between the executors.
Could you please give me some advice?
Kind regards,