This question is more about a type of processing that I haven't been able to really find a good solution for within spark than it is about the FlatMapGroupsFunction specifically. However, the FlatMapGroupsFunction serves as a good concrete example of what I'm trying to talk about.
In MapReduce if I want to do something like replicate each record that maps to a particular key (100, 1000, 10000.. etc) times I can because for each record I'll just set up a loop for the number of replications that I want and use the context.write(outKey, outValue) method to serialize out the data. The amount of memory that I would ever be required to use would be the size of one of the objects that I'm replicating plus some input and output buffer overhead. Now if I want to do this in Spark there are a couple ways I could try to do this, but I'm not sure any of them would work in the limit of replicating my data so much that I would run my executor out of memory. The interface for the FlatMapGroupsFunction is a single function call http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/function/FlatMapGroupsFunction.html . The return type here is an Iterator<R> which implies that I would have to materialize N replications of my data in memory in order to return an iterator over it. Is there not a way that I can return data to Spark such that it buffers my output records for as long as it continues to have available memory and then it can spill it to disk itself? This idea isn't limited to the Dataset API. The api for the pyspark rdd flatmap function has the same implications. Again I'm not trying to create the worlds greatest data cloning application. It just serves the purpose of an example. Basically I would like a memory safe way to let the framework handle me creating more data within a single task context than can be contained by an executor. Thanks in advance for any info people can provide. If you don't necessarily have an answer I'm happy to kind of brainstorm on potential solutions as well. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
