HeartSaVioR commented on PR #54523: URL: https://github.com/apache/spark/pull/54523#issuecomment-3994054975
@holdenk First of all, are you onboarded with the section we described as "Why are the changes needed?"? In overall, we find the API providing iterator to users directly be "too" flexible to mess up the latency by users' own hand, as well as strongly restrict the future design on features, compared to other streaming frameworks. Unlike traditional Spark execution, in RTM, it is more ideal to build the baseline of the execution to record-to-record and we have APIs which are opposite of this. Assuming you are onboarded, I think you raised the good point. We missed that and we are not considering users to do the hack to workaround it. Ideally speaking, we'd need to provide the "official" way to initialize the resources/heavy cost objects and clear them at the task completion (maybe traditional interface of open/process/close). That warrants a new API - either existing API with new signature or simply a divergence. That'd take time to go through. That said, we feel like it's better to block the problematic case first and have time to work on alternative thoughtfully. We just don't want to rush for alternative just because we want to block the problematic case today. But if you strongly demand the alternative to block the case, we can consider it though it may not fit to Apache Spark 4.2 timeline. Would love to hear from you about the plan. Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
