Re: Scan Sharing in Spark

2015-05-05 Thread Quang-Nhat HOANG-XUAN
Hi, Beside caching, is it possible if an RDD has multiple child RDDs? So I can read the input one and produce multiple outputs for multiple jobs which share the input. On May 5, 2015 6:24 PM, "Evan R. Sparks" wrote: > Scan sharing can indeed be a useful optimization in spark, because you > amort

Re: Scan Sharing in Spark

2015-05-05 Thread Evan R. Sparks
Scan sharing can indeed be a useful optimization in spark, because you amortize not only the time spent scanning over the data, but also time spent in task launch and scheduling overheads. Here's a trivial example in scala. I'm not aware of a place in SparkSQL where this is used - I'd imagine that

Scan Sharing in Spark

2015-05-05 Thread Quang-Nhat HOANG-XUAN
Hi everyone, I have two Spark jobs inside a Spark Application, which read from the same input file. They are executed in 2 threads. Right now, I cache the input file into memory before executing these two jobs. Are there another ways to share their same input with just only one read? I know ther