subject:"Apache Spark \- Question about Structured Streaming Sink addBatch dataframe size"

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-05 Thread M Singh

Hi Jacek: The javadoc mentions that we can only consume data from the data frame in the addBatch method. So, if I would like to save the data to a new sink then I believe that I will need to collect the data and then save it. This is the reason I am asking about how to control the size of the

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread Jacek Laskowski

Hi, > If the data is very large then a collect may result in OOM. That's a general case even in any part of Spark, incl. Spark Structured Streaming. Why would you collect in addBatch? It's on the driver side and as anything on the driver, it's a single JVM (and usually not fault tolerant) > Do y

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread M Singh

Thanks Tathagata for your answer. The reason I was asking about controlling data size is that the javadoc indicate you can use foreach or collect on the dataframe. If the data is very large then a collect may result in OOM. >From your answer it appears that the only way to control the size (in 2

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-03 Thread Tathagata Das

1. It is all the result data in that trigger. Note that it takes a DataFrame which is a purely logical representation of data and has no association with partitions, etc. which are physical representations. 2. If you want to limit the amount of data that is processed in a trigger, then you should

Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-03 Thread M Singh

Hi: The documentation for Sink.addBatch is as follows: /** * Adds a batch of data to this sink. The data for a given `batchId` is deterministic and if * this method is called more than once with the same batchId (which will happen in the case of * failures), then `data` should only be ad

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

5 matches

Site Navigation

Mail list logo

Footer information