The driver it’s self is probably another topic, perhaps I’ll make a “faster spark star time” JIRA and a DA JIRA and we can explore both.
On Tue, Aug 8, 2023 at 10:07 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > From my own perspective faster execution time especially with Spark on tin > boxes (Dataproc & EC2) and Spark on k8s is something that customers often > bring up. > > Poor time to onboard with autoscaling seems to be particularly singled out > for heavy ETL jobs that use Spark. I am disappointed to see the poor > performance of Spark on k8s autopilot with timelines starting the driver > itself and moving from Pending to Running phase (Spark 4.3.1 with Java 11) > > HTH > > Mich Talebzadeh, > Solutions Architect/Engineering Lead > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 8 Aug 2023 at 15:49, kalyan <justfors...@gmail.com> wrote: > >> +1 to enhancements in DEA. Long time due! >> >> There were a few things that I was thinking along the same lines for some >> time now(few overlap with @holden 's points) >> 1. How to reduce wastage on the RM side? Sometimes the driver asks for >> some units of resources. But when RM provisions them, the driver cancels >> it. >> 2. How to make the resource available when it is needed. >> 3. Cost Vs AppRunTime: A good DEA algo should allow the developer to >> choose between cost and runtime. Sometimes developers might be ok to pay >> higher costs for faster execution. >> 4. Stitch resource profile choices into query execution. >> 5. Allow different DEA algo to be chosen for different queries within the >> same spark application. >> 6. Fall back to default algo, when things go haywire! >> >> Model-based learning would be awesome. >> These can be fine-tuned with some tools like sparklens. >> >> I am aware of a few experiments carried out in this area by my friends in >> this domain. One lesson we had was, it is hard to have a generic algorithm >> that worked for all cases. >> >> Regards >> kalyan. >> >> >> On Tue, Aug 8, 2023 at 6:12 PM Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> Thanks for pointing out this feature to me. I will have a look when I >>> get there. >>> >>> Mich Talebzadeh, >>> Solutions Architect/Engineering Lead >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Tue, 8 Aug 2023 at 11:44, roryqi(齐赫) <ror...@tencent.com> wrote: >>> >>>> Spark 3.5 have added an method `supportsReliableStorage` in the ` >>>> ShuffleDriverComponents` which indicate whether writing shuffle data >>>> to a distributed filesystem or persisting it in a remote shuffle service. >>>> >>>> Uniffle is a general purpose remote shuffle service ( >>>> https://github.com/apache/incubator-uniffle). It can enhance the >>>> experience of Spark on K8S. After Spark 3.5 is released, Uniffle will >>>> support the `ShuffleDriverComponents`. you can see [1]. >>>> >>>> If you have interest about more details of Uniffle, you can see [2] >>>> >>>> >>>> [1] https://github.com/apache/incubator-uniffle/issues/802. >>>> >>>> [2] >>>> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era >>>> >>>> >>>> >>>> *发件人**: *Mich Talebzadeh <mich.talebza...@gmail.com> >>>> *日期**: *2023年8月8日 星期二 06:53 >>>> *抄送**: *dev <dev@spark.apache.org> >>>> *主题**: *[Internet]Re: Improving Dynamic Allocation Logic for Spark 4+ >>>> >>>> >>>> >>>> On the subject of dynamic allocation, is the following message a cause >>>> for concern when running Spark on k8s? >>>> >>>> >>>> >>>> INFO ExecutorAllocationManager: Dynamic allocation is enabled without a >>>> shuffle service. >>>> >>>> >>>> Mich Talebzadeh, >>>> >>>> Solutions Architect/Engineering Lead >>>> >>>> London >>>> >>>> United Kingdom >>>> >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Mon, 7 Aug 2023 at 23:42, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> >>>> From what I have seen spark on a serverless cluster has hard up getting >>>> the driver going in a timely manner >>>> >>>> >>>> >>>> Annotations: autopilot.gke.io/resource-adjustment: >>>> >>>> >>>> {"input":{"containers":[{"limits":{"memory":"1433Mi"},"requests":{"cpu":"1","memory":"1433Mi"},"name":"spark-kubernetes-driver"}]},"output... >>>> >>>> autopilot.gke.io/warden-version: 2.7.41 >>>> >>>> >>>> >>>> This is on spark 3.4.1 with Java 11 both the host running spark-submit >>>> and the docker itself >>>> >>>> >>>> >>>> I am not sure how relevant this is to this discussion but it looks like >>>> a kind of blocker for now. What config params can help here and what can be >>>> done? >>>> >>>> >>>> >>>> Thanks >>>> >>>> >>>> >>>> Mich Talebzadeh, >>>> >>>> Solutions Architect/Engineering Lead >>>> >>>> London >>>> >>>> United Kingdom >>>> >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Mon, 7 Aug 2023 at 22:39, Holden Karau <hol...@pigscanfly.ca> wrote: >>>> >>>> Oh great point >>>> >>>> >>>> >>>> On Mon, Aug 7, 2023 at 2:23 PM bo yang <bobyan...@gmail.com> wrote: >>>> >>>> Thanks Holden for bringing this up! >>>> >>>> >>>> >>>> Maybe another thing to think about is how to make dynamic allocation >>>> more friendly with Kubernetes and disaggregated shuffle storage? >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Mon, Aug 7, 2023 at 1:27 PM Holden Karau <hol...@pigscanfly.ca> >>>> wrote: >>>> >>>> So I wondering if there is interesting in revisiting some of how Spark >>>> is doing it's dynamica allocation for Spark 4+? >>>> >>>> >>>> >>>> Some things that I've been thinking about: >>>> >>>> >>>> >>>> - Advisory user input (e.g. a way to say after X is done I know I need >>>> Y where Y might be a bunch of GPU machines) >>>> >>>> - Configurable tolerance (e.g. if we have at most Z% over target no-op) >>>> >>>> - Past runs of same job (e.g. stage X of job Y had a peak of K) >>>> >>>> - Faster executor launches (I'm a little fuzzy on what we can do here >>>> but, one area for example is we setup and tear down an RPC connection to >>>> the driver with a blocking call which does seem to have some locking inside >>>> of the driver at first glance) >>>> >>>> >>>> >>>> Is this an area other folks are thinking about? Should I make an epic >>>> we can track ideas in? Or are folks generally happy with today's dynamic >>>> allocation (or just busy with other things)? >>>> >>>> >>>> >>>> -- >>>> >>>> Twitter: https://twitter.com/holdenkarau >>>> >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> >>>> -- >>>> >>>> Twitter: https://twitter.com/holdenkarau >>>> >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> >>>> -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau