Re: Is there a way to decide what RDDs get cached in the Spark Runner?

Kyle Weaver Tue, 14 May 2019 09:40:43 -0700

Minor correction: Slack channel is actually #beam-spark

Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com |
+16502035555



*From: *Kyle Weaver <kcwea...@google.com>
*Date: *Tue, May 14, 2019 at 9:38 AM
*To: * <user@beam.apache.org>

Hi Augusto,
>
> Right now the default behavior is to cache all intermediate RDDs that are
> consumed more than once by the pipeline. This can be disabled with
> `options.setCacheDisabled(true)` [1], but there is currently no way for the
> user to specify to the runner that it should cache certain RDDs, but not
> others.
>
> There has recently been some discussion on the Slack (#spark-beam) about
> implementing such a feature, but no concrete plans as of yet.
>
> [1]
> https://github.com/apache/beam/blob/81faf35c8a42493317eba9fa1e7b06fb42d54662/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java#L150
>
> Thanks
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
> | +16502035555
>
>
> *From: *augusto....@gmail.com <augusto....@gmail.com>
> *Date: *Tue, May 14, 2019 at 5:01 AM
> *To: * <user@beam.apache.org>
>
> Hi,
>>
>> I guess the title says it all, right now it seems like BEAM caches all
>> the intermediate RDD results for my pipeline when using the Spark runner,
>> this leads to a very inefficient usage of memory. Any way to control this?
>>
>> Best regards,
>> Augusto
>>
>

Re: Is there a way to decide what RDDs get cached in the Spark Runner?

Reply via email to