I don't think its feasible with the current logic. Typically the query planning time should be a tiny fraction unless you are processing tiny micro-batches more frequently. You might want to consider adjusting the trigger interval to processes more data per micro-batch and see if it helps. The tiny micro-batch use cases should ideally be solved using continuous mode (once it matures) which would not have this overhead.
Thanks, Arun On Mon, 18 Mar 2019 at 00:39, Jungtaek Lim <kabh...@gmail.com> wrote: > Almost everything is coupled with logical plan right now, including > updated range for source in new batch, updated watermark for stateful > operations, random seed in each batch. Please refer below codes: > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala > > We might try out replacing these things in physical plan so that logical > plan doesn't need to be evaluated, but not sure it's feasible. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > 2019년 3월 18일 (월) 오후 4:03, Paolo Platter <paolo.plat...@agilelab.it>님이 작성: > >> I can understand that if you involve columns with variable distribution >> in join operations, it may change your execution plan, but most of the time >> this is not going to happen, in streaming the most used operations are: map >> filter, grouping and stateful operations and in all these cases I can't how >> a dynamic query planning could help. >> >> It could be useful to have a parameter to force a streaming query to >> calculate the query plan just once. >> >> Paolo >> >> >> >> Ottieni Outlook per Android <https://aka.ms/ghei36> >> >> ------------------------------ >> *From:* Alessandro Solimando <alessandro.solima...@gmail.com> >> *Sent:* Thursday, March 14, 2019 6:59:50 PM >> *To:* Paolo Platter >> *Cc:* user@spark.apache.org >> *Subject:* Re: Structured Streaming & Query Planning >> >> Hello Paolo, >> generally speaking, query planning is mostly based on statistics and >> distributions of data values for the involved columns, which might >> significantly change over time in a streaming context, so for me it makes a >> lot of sense that it is run at every schedule, even though I understand >> your concern. >> >> For the second question I don't know how to (or if you even can) cache >> the computed query plan. >> >> If possible, would you mind sharing your findings afterwards? (query >> planning on streaming it's a very interesting and not yet enough explored >> topic IMO) >> >> Best regards, >> Alessandro >> >> On Thu, 14 Mar 2019 at 16:51, Paolo Platter <paolo.plat...@agilelab.it> >> wrote: >> >>> Hi All, >>> >>> >>> >>> I would like to understand why in a streaming query ( that should not be >>> able to change its behaviour along iterations ) there is a >>> queryPlanning-Duration effort ( in my case is 33% of trigger interval ) at >>> every schedule. I don’t uderstand why this is needed and if it is possible >>> to disable or cache it. >>> >>> >>> >>> Thanks >>> >>> >>> >>> >>> >>> [image: cid:image001.jpg@01D41D15.E01B6F00] >>> >>> *Paolo Platter* >>> >>> *CTO* >>> >>> E-mail: paolo.plat...@agilelab.it >>> >>> Web Site: www.agilelab.it >>> >>> >>> >>> >>> >> > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org