Hi Jacek:
The javadoc mentions that we can only consume data from the data frame in the
addBatch method. So, if I would like to save the data to a new sink then I
believe that I will need to collect the data and then save it. This is the
reason I am asking about how to control the size of the
Hi,
> If the data is very large then a collect may result in OOM.
That's a general case even in any part of Spark, incl. Spark Structured
Streaming. Why would you collect in addBatch? It's on the driver side and
as anything on the driver, it's a single JVM (and usually not fault
tolerant)
> Do y
Thanks Tathagata for your answer.
The reason I was asking about controlling data size is that the javadoc
indicate you can use foreach or collect on the dataframe. If the data is very
large then a collect may result in OOM.
>From your answer it appears that the only way to control the size (in 2
1. It is all the result data in that trigger. Note that it takes a
DataFrame which is a purely logical representation of data and has no
association with partitions, etc. which are physical representations.
2. If you want to limit the amount of data that is processed in a trigger,
then you should
Hi:
The documentation for Sink.addBatch is as follows:
/** * Adds a batch of data to this sink. The data for a given `batchId` is
deterministic and if * this method is called more than once with the same
batchId (which will happen in the case of * failures), then `data` should
only be ad