Re: Structured Streaming & Query Planning

Arun Mahadevan Mon, 18 Mar 2019 11:05:26 -0700

I don't think its feasible with the current logic. Typically the query
planning time should be a tiny fraction unless you are processing tiny
micro-batches more frequently. You might want to consider adjusting the
trigger interval to processes more data per micro-batch and see if it
helps. The tiny micro-batch use cases should ideally be solved using
continuous mode (once it matures) which would not have this overhead.


Thanks,
Arun

On Mon, 18 Mar 2019 at 00:39, Jungtaek Lim <kabh...@gmail.com> wrote:

> Almost everything is coupled with logical plan right now, including
> updated range for source in new batch, updated watermark for stateful
> operations, random seed in each batch. Please refer below codes:
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala
>
> We might try out replacing these things in physical plan so that logical
> plan doesn't need to be evaluated, but not sure it's feasible.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2019년 3월 18일 (월) 오후 4:03, Paolo Platter <paolo.plat...@agilelab.it>님이 작성:
>
>> I can understand that if you involve columns with variable distribution
>> in join operations, it may change your execution plan, but most of the time
>> this is not going to happen, in streaming the most used operations are: map
>> filter, grouping and stateful operations and in all these cases I can't how
>> a dynamic query planning could help.
>>
>> It could be useful to have a parameter to force a streaming query to
>> calculate the query plan just once.
>>
>> Paolo
>>
>>
>>
>> Ottieni Outlook per Android <https://aka.ms/ghei36>
>>
>> ------------------------------
>> *From:* Alessandro Solimando <alessandro.solima...@gmail.com>
>> *Sent:* Thursday, March 14, 2019 6:59:50 PM
>> *To:* Paolo Platter
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Structured Streaming & Query Planning
>>
>> Hello Paolo,
>> generally speaking, query planning is mostly based on statistics and
>> distributions of data values for the involved columns, which might
>> significantly change over time in a streaming context, so for me it makes a
>> lot of sense that it is run at every schedule, even though I understand
>> your concern.
>>
>> For the second question I don't know how to (or if you even can) cache
>> the computed query plan.
>>
>> If possible, would you mind sharing your findings afterwards? (query
>> planning on streaming it's a very interesting and not yet enough explored
>> topic IMO)
>>
>> Best regards,
>> Alessandro
>>
>> On Thu, 14 Mar 2019 at 16:51, Paolo Platter <paolo.plat...@agilelab.it>
>> wrote:
>>
>>> Hi All,
>>>
>>>
>>>
>>> I would like to understand why in a streaming query ( that should not be
>>> able to change its behaviour along iterations ) there is a
>>> queryPlanning-Duration effort ( in my case is 33% of trigger interval ) at
>>> every schedule. I don’t uderstand  why this is needed and if it is possible
>>> to disable or cache it.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>> [image: cid:image001.jpg@01D41D15.E01B6F00]
>>>
>>> *Paolo Platter*
>>>
>>> *CTO*
>>>
>>> E-mail:        paolo.plat...@agilelab.it
>>>
>>> Web Site:   www.agilelab.it
>>>
>>>
>>>
>>>
>>>
>>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Structured Streaming & Query Planning

Reply via email to