Minor correction: Slack channel is actually #beam-spark Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com | +16502035555
*From: *Kyle Weaver <kcwea...@google.com> *Date: *Tue, May 14, 2019 at 9:38 AM *To: * <user@beam.apache.org> Hi Augusto, > > Right now the default behavior is to cache all intermediate RDDs that are > consumed more than once by the pipeline. This can be disabled with > `options.setCacheDisabled(true)` [1], but there is currently no way for the > user to specify to the runner that it should cache certain RDDs, but not > others. > > There has recently been some discussion on the Slack (#spark-beam) about > implementing such a feature, but no concrete plans as of yet. > > [1] > https://github.com/apache/beam/blob/81faf35c8a42493317eba9fa1e7b06fb42d54662/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java#L150 > > Thanks > > Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com > | +16502035555 > > > *From: *augusto....@gmail.com <augusto....@gmail.com> > *Date: *Tue, May 14, 2019 at 5:01 AM > *To: * <user@beam.apache.org> > > Hi, >> >> I guess the title says it all, right now it seems like BEAM caches all >> the intermediate RDD results for my pipeline when using the Spark runner, >> this leads to a very inefficient usage of memory. Any way to control this? >> >> Best regards, >> Augusto >> >