Re: Structured Streaming Microbatch Semantics

Mich Talebzadeh Mon, 08 Mar 2021 04:07:38 -0800

Ok thanks for the diagram.

So you have ~ 30 seconds duration of each bach as in foreachBatch and 60
rows per batch


Back to your question:

"My question is now: Is it guaranteed by Spark that all output records
of one event are always contained in a single batch or can the records
also be split into multiple batches?"


ForeachBatch processes all records sent within that batch (these can vary
but includes all that is not processed) until they are completed before the
next batch starts. Batches are processed sequentially. There is no way
foreachBatch to decide to process half rows etc.


When you say an event you mean one batch correct? So yes a single event
will be processed by one batch until it finishes.


I performed a test on it


In my method that processes each batch like:


                    foreachBatch(*SendToBigQuery*). \

That method *SendToBigQuery* will have two params df, batchId

def SendToBigQuery(df, batchId):

    """
        Below uses standard Spark-BigQuery API to write to the table
        Additional transformation logic will be performed here
    """
    print(batchId)
    if(len(df.take(1))) > 0:
        #df.printSchema()
        df. persist()
        print(df.count())
        # Write data to config['MDVariables']['targetTable']
        s.writeTableToBQ(df, "append",
config['MDVariables']['targetDataset'],config['MDVariables']['targetTable'])


Note that I had not run this streaming job for a couple of days. So the
first batch picked up all from where it was left (backlog):


{'message': 'Initializing sources', 'isDataAvailable': False,
'isTriggerActive': False}

[]

None

0      <-print(batchId)

1602277  <- print(df.count())

1

450

2

100

3

150

4

130

5

160

6

100

7

170

8

100

9

100

10

130

That first batch (1,602,227 rows) took a long time to be processed (write
to Google BigQuery in cloud), which meant the next batch had 450 rows to
deal with and so forth. Remember nothing is lost but in terms of processing
it takes what it takes


I have attached a structured streaming page. note that the first batch
dealing with 1,602,227  rows took 160 seconds to finish!


HTH


LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 8 Mar 2021 at 10:24, Dipl.-Inf. Rico Bergmann <i...@ricobergmann.de>
wrote:

> Hi Mich!
>
> Here's a screenshot of the processing rates.
>
>
> Best,
>
> Rico.
>
>
> Am 05.03.2021 um 16:07 schrieb Mich Talebzadeh:
>
> Hi Rico,
>
> Would it be possible for you to provide a snapshot of Structured Streaming
> Tab (from Spark GUI) if possible?
>
> Thanks
>
>
> Mich
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 5 Mar 2021 at 13:44, Dipl.-Inf. Rico Bergmann <
> i...@ricobergmann.de> wrote:
>
>> Hi!
>>
>> As abstract code what I do in my streaming program is:
>>
>> readStream() //from Kafka
>>
>> .flatMap(readIngestionDatasetViaREST) //can return thousands of records
>> for a single event
>>
>>
>> .writeStream.outputMode("append").foreachBatch(upsertIntoDeltaTable).start()
>>
>>
>> I don't use triggers but I limit the number of events per trigger in the
>> Kafka reader.
>>
>>
>> What do you mean with process rate below batch duration? The process rate
>> is records per sec. (in my current deployment it's approx. 1), batch
>> duration is sec. (at around 60 sec.)
>>
>>
>> Best,
>>
>> Rico
>> Am 05.03.2021 um 10:58 schrieb Mich Talebzadeh:
>>
>> Hi Ricco,
>>
>> Just to clarify, your batch interval  may have a variable number of rows
>> sent to Kafka topic for each event?
>>
>> In your writeStream code
>>
>>                    writeStream. \
>>                      outputMode('append'). \
>>                      option("truncate", "false"). \
>>                      foreachBatch(SendToBigQuery). \
>>                      trigger(processingTime='2 seconds'). \
>>                      start()
>>
>>
>> Have you defined trigger(processingTime)? That is equivalent to your sliding
>> interval.
>>
>> In general, processingTime == bath interval (the event).
>>
>> In Spark GUI, under Structured streaming, you have Input Rate, Process
>> Rate and Batch Duration. Your process Rate has to be below Batch Duration. 
>> ForeachBatch
>> will process all the data come in before moving to the next batch. It is up
>> to the designer to ensure that the processing time is below the event so
>> Spark can process it.
>>
>> HTH
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 5 Mar 2021 at 08:06, Dipl.-Inf. Rico Bergmann <
>> i...@ricobergmann.de> wrote:
>>
>>> Hi all!
>>>
>>> I'm using Spark structured streaming for a data ingestion pipeline.
>>> Basically the pipeline reads events (notifications of new available
>>> data) from a Kafka topic and then queries a REST endpoint to get the
>>> real data (within a flatMap).
>>>
>>> For one single event the pipeline creates a few thousand records (rows)
>>> that have to be stored. And to write the data I use foreachBatch().
>>>
>>> My question is now: Is it guaranteed by Spark that all output records of
>>> one event are always contained in a single batch or can the records also
>>> be split into multiple batches?
>>>
>>>
>>> Best,
>>>
>>> Rico.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Structured Streaming Microbatch Semantics

Reply via email to