Hi Dev community, Just bumping to see if there are more reviews to evaluate this idea of adding auto-scaling to structured streaming.
Thanks again, Pavan On Wed, Aug 23, 2023 at 2:49 PM Pavan Kotikalapudi <pkotikalap...@twilio.com> wrote: > Thanks for the review Mich. > > I have updated the Q4 with as concise information as possible and left the > detailed explanation to Appendix. > > here is the updated answer to the Q4 > <https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit#heading=h.xe0x4i9gc1dg> > > Thank you, > > Pavan > > On Wed, Aug 23, 2023 at 2:46 AM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> Hi Pavan, >> >> I started reading your SPIP but have difficulty understanding it in >> detail. >> >> Specifically under Q4, " What is new in your approach and why do you >> think it will be successful?", I believe it would be better to remove the >> plots and focus on "what this proposed solution is going to add to the >> current play". At this stage a concise briefing would be appreciated and >> the specific plots should be left to the Appendix. >> >> HTH >> >> >> Mich Talebzadeh, >> Distinguished Technologist, Solutions Architect & Engineer >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wSpQtgviw$> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wR3SukiIw$> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Sun, 20 Aug 2023 at 07:40, Pavan Kotikalapudi < >> pkotikalap...@twilio.com> wrote: >> >>> IMO ML might be good for cluster scheduler but for the core DRA >>> algorithm of SSS I believe we should start with some primitives of >>> Structured streaming. I would love to get some reviews on the doc and >>> opinions on the feasibility of the solution. >>> >>> We have seen quite some savings using this solution in our team, Would >>> like to listen to the dev community to see if they are looking >>> for/interested in DRA for structured streaming. >>> >>> On Mon, Aug 14, 2023 at 9:12 AM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> Thank you for your comments. >>>> >>>> My vision of integrating machine learning (ML) into Spark Structured >>>> Streaming (SSS) for capacity planning and performance optimization seems to >>>> be promising. By leveraging ML techniques, I believe that we can >>>> potentially create predictive models that enhance the efficiency and >>>> resource allocation of the data processing pipelines. Here are some >>>> potential benefits and considerations for adding ML to SSS for capacity >>>> planning. However, I stand corrected >>>> >>>> 1. >>>> >>>> *Predictive Capacity Planning:* ML models can analyze historical >>>> data (that we discussed already), workloads, and trends to predict >>>> future >>>> resource needs accurately. This enables proactive scaling and >>>> allocation of >>>> resources, ensuring optimal performance during high-demand periods, >>>> such as >>>> times of high trades. >>>> 2. >>>> >>>> *Real-time Decision Making: *ML can be used to make real-time >>>> decisions on resource allocation (software and cluster) based on current >>>> data and conditions, allowing for dynamic adjustments to meet the >>>> processing demands. >>>> 3. >>>> >>>> *Complex Data Analysis: *In a heterogeneous setup involving >>>> multiple databases, ML can analyze various factors like data read and >>>> write >>>> times from different databases, data volumes, and data distribution >>>> patterns to optimize the overall data processing flow. >>>> 4. >>>> >>>> *Anomaly Detection: *ML models can identify unusual patterns or >>>> performance deviations, alerting us to potential issues before they >>>> impact >>>> the system. >>>> 5. >>>> >>>> Integration with Monitoring: ML models can work alongside >>>> monitoring tools, gathering real-time data on various performance >>>> metrics, >>>> and using this data for making intelligent decisions on capacity and >>>> resource allocation. >>>> >>>> However, there are some important considerations to keep in mind: >>>> >>>> 1. >>>> >>>> *Model Training: *ML models require training and validation using >>>> relevant data. Our DS colleagues need to define appropriate features, >>>> select the right ML algorithms, and fine-tune the model parameters to >>>> achieve optimal performance. >>>> 2. >>>> >>>> *Complexity:* Integrating ML adds complexity to our architecture. >>>> Moreover, we need to have the necessary expertise in both Spark >>>> Structured >>>> Streaming and machine learning to design, implement, and maintain the >>>> system effectively. >>>> 3. >>>> >>>> *Resource Overhead: *ML algorithms can be resource-intensive. We >>>> ought to consider the additional computational requirements, especially >>>> during the model training and inference phases. >>>> 4. >>>> >>>> In summary, this idea of utilizing ML for capacity planning in >>>> Spark Structured Streaming can possibly hold significant potential for >>>> improving system performance and resource utilization. Having said >>>> that, I >>>> totally agree that we need to evaluate the feasibility, potential >>>> benefits, >>>> and challenges and we will need involving experts in both Spark and >>>> machine >>>> learning to ensure a successful outcome. >>>> >>>> HTH >>>> >>>> Mich Talebzadeh, >>>> Solutions Architect/Engineering Lead >>>> London >>>> United Kingdom >>>> >>>> >>>> view my Linkedin profile >>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$> >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Mon, 14 Aug 2023 at 14:58, Martin Andersson < >>>> martin.anders...@kambi.com> wrote: >>>> >>>>> IMO, using any kind of machine learning or AI for DRA is overkill. The >>>>> effort involved would be considerable and likely counterproductive, >>>>> compared to a more conventional approach of comparing the rate of incoming >>>>> stream data with the effort of handling previous data rates. >>>>> ------------------------------ >>>>> *From:* Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> *Sent:* Tuesday, August 8, 2023 19:59 >>>>> *To:* Pavan Kotikalapudi <pkotikalap...@twilio.com> >>>>> *Cc:* dev@spark.apache.org <dev@spark.apache.org> >>>>> *Subject:* Re: Dynamic resource allocation for structured streaming >>>>> [SPARK-24815] >>>>> >>>>> >>>>> EXTERNAL SENDER. Do not click links or open attachments unless you >>>>> recognize the sender and know the content is safe. DO NOT provide your >>>>> username or password. >>>>> >>>>> I am currently contemplating and sharing my thoughts openly. >>>>> Considering our reliance on previously collected statistics (as mentioned >>>>> earlier), it raises the question of why we couldn't integrate certain >>>>> machine learning elements into Spark Structured Streaming? While this >>>>> might >>>>> slightly deviate from our current topic, I am not an expert in machine >>>>> learning. However, there are individuals who possess the expertise to >>>>> assist us in exploring this avenue. >>>>> >>>>> HTH >>>>> >>>>> Mich Talebzadeh, >>>>> Solutions Architect/Engineering Lead >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$> >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi < >>>>> pkotikalap...@twilio.com> wrote: >>>>> >>>>> Listeners are the best resources to the allocation manager afaik... >>>>> It already has SparkListener >>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala*L640__;Iw!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-YRkCAu0w$> >>>>> that >>>>> it utilizes. We can use it to extract more information (like processing >>>>> times). >>>>> The one with more information regarding streaming query resides in sql >>>>> module >>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-Y_DIYqaw$> >>>>> though. >>>>> >>>>> Thanks >>>>> >>>>> Pavan >>>>> >>>>> On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>> Hi Pavan or anyone else >>>>> >>>>> Is there any way one access the matrix displayed on SparkGUI? For >>>>> example the readings for processing time? Can these be acessed? >>>>> >>>>> Thanks >>>>> >>>>> For example, >>>>> Mich Talebzadeh, >>>>> Solutions Architect/Engineering Lead >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6VrCySTg$> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d4r4xOqSg$> >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi < >>>>> pkotikalap...@twilio.com> wrote: >>>>> >>>>> Thanks for the review Mich, >>>>> >>>>> Yes, the configuration parameters we end up setting would be based on >>>>> the trigger interval. >>>>> >>>>> > If you are going to have additional indicators why not look at >>>>> scheduling delay as well >>>>> Yes. The implementation is based on scheduling delays, not for pending >>>>> tasks of the current stage but rather pending tasks of all the stages >>>>> in a micro-batch >>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352/files*diff-fdddb0421641035be18233c212f0e3ccd2d6a49d345bd0cd4eac08fc4d911e21R1025__;Iw!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6feoFH2Q$> >>>>> (hence >>>>> trigger interval). >>>>> >>>>> > we ought to utilise the historical statistics collected under the >>>>> checkpointing directory to get more accurate statistics >>>>> You are right! This is just a simple implementation based on one >>>>> factor, we should also look into other indicators as well If that would >>>>> help build a better scaling algorithm. >>>>> >>>>> Thank you, >>>>> >>>>> Pavan >>>>> >>>>> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I glanced over the design doc. >>>>> >>>>> You are providing certain configuration parameters plus some settings >>>>> based on static values. For example: >>>>> >>>>> spark.dynamicAllocation.schedulerBacklogTimeout": 54s >>>>> >>>>> I cannot see any use of <processing time> which ought to be at least >>>>> half of the batch interval to have the correct margins (confidence >>>>> level). If >>>>> you are going to have additional indicators why not look at scheduling >>>>> delay as well. Moreover most of the needed statistics are also available >>>>> to >>>>> set accurate values. My inclination is that this is a great effort >>>>> but we ought to utilise the historical statistics collected under >>>>> checkpointing directory to get more accurate statistics. I will >>>>> review the design document in duew course >>>>> >>>>> HTH >>>>> >>>>> Mich Talebzadeh, >>>>> Solutions Architect/Engineering Lead >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFr9YSZnw$> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLEPx44C1w$> >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi >>>>> <pkotikalap...@twilio.com.invalid> wrote: >>>>> >>>>> Hi Spark Dev, >>>>> >>>>> I have extended traditional DRA to work for structured streaming >>>>> use-case. >>>>> >>>>> Here is an initial Implementation draft PR >>>>> https://github.com/apache/spark/pull/42352 >>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLHLe7WCUw$> >>>>> and >>>>> design doc: >>>>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing >>>>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFAjJfilg$> >>>>> >>>>> Please review and let me know what you think. >>>>> >>>>> Thank you, >>>>> >>>>> Pavan >>>>> >>>>>