I’ve been thinking about this quite a bit today and what an implementation on
the spark side would look like.
After some deliberation I concluded:
We should instead have an `onQueryTriggerStart` method that is published every
time a MicroBatch is triggered
This should of course be disabled by d
One issue I've seen is that after about 24 hours, the sparkapplication job
pods seem to be getting evicted .. i've installed spark history server,
and am verifying the case.
It could be due to resource constraints, checking this.
Pls note : kubeflow spark operator is installed in namespace - so35
Thanks, Megh !
I did some research and realized the same - PVC is not a good option for
spark shuffle, primarily for latency issues.
The same is the case with S3 or MinIO.
I've implemented option 2, and am testing this out currently: Storing data
in host path is possible
regds,
Karan Alang
O
Hello Karan,
Apart from Celeborn, there is Apache Uniffle (Incubating) as well. We also
have similar setup as yours and we're trying out a PoC with Uniffle right
now.
What I've gathered so far is, with Uniffle:
1. Storing data in PVCs is not well supported
2. Storing data in host path is possible
Hello,
I'm trying to run a simple Python client against a spark connect server
running in Kubernetes as a proof-of-concept. The client writes a couple
of records to a local Iceberg table. The Iceberg runtime is provisioned
using "--packages" argument to the "start-connect-server.sh" and I see