Hi, Beside caching, is it possible if an RDD has multiple child RDDs? So I can read the input one and produce multiple outputs for multiple jobs which share the input. On May 5, 2015 6:24 PM, "Evan R. Sparks" <evan.spa...@gmail.com> wrote:
> Scan sharing can indeed be a useful optimization in spark, because you > amortize not only the time spent scanning over the data, but also time > spent in task launch and scheduling overheads. > > Here's a trivial example in scala. I'm not aware of a place in SparkSQL > where this is used - I'd imagine that most development effort is being > placed on single-query optimization right now. > > //This function takes a sequence of functions of type A => B and returns a > function of A => Seq[B] where each item in the input list corresponds to a > def combineFunctions[A,B](fns: Seq[A=>B]): A => Seq[B] = { > def combf(a: A): Seq[B] = { > fns.map(f => f(a)) > } > combf > } > > def plusOne(x: Int) = x + 1 > def timesFive(x: Int) = x * 5 > > val sharedF = combineFunctions(Seq[Int => Int](plusOne, timesFive)) > > val data = sc.parallelize(Array(1,2,3,4,5,6,7)) > > //Apply this combine function to each of your data elements. > val res = data.map(sharedF) > > res.take(5) > > The result will look something like this. > > res5: Array[Seq[Int]] = Array(List(2, 5), List(3, 10), List(4, 15), > List(5, 20), List(6, 25)) > > > > On Tue, May 5, 2015 at 8:53 AM, Quang-Nhat HOANG-XUAN < > hxquangn...@gmail.com> wrote: > >> Hi everyone, >> >> I have two Spark jobs inside a Spark Application, which read from the same >> input file. >> They are executed in 2 threads. >> >> Right now, I cache the input file into memory before executing these two >> jobs. >> >> Are there another ways to share their same input with just only one read? >> I know there is something called Multiple Query Optimization, but I don't >> know if it can be applicable on Spark (or SparkSQL) or not? >> >> Thank you. >> >> Quang-Nhat >> > >