Ah, I see. You need to
follow those other calls through to their implementations to see what
ultimately happens. For example, the map() calls are to RDD.map, not one
of Scala's built-in map methods for collections. The implementation
looks like this:So once you get to one of the most primitive operations, like, map(), you'll see the function actually generates a specific type of RDD representing the transformation. MappedRDD just stores a reference to the previous RDD, the function it needs to apply -- it doesn't actually contain any data. Of course the idea is that it *looks* like the normal map(), filter(), etc. in Scala, but it doesn't work the same way. By calling a bunch of these functions, you end up generating a graph, specifically a DAG, of RDDs. This graph describes all the steps needed to perform the operation, but no data. The final action, e.g. count() or collect(), that triggers computation is called on one of these RDDs. To get the value out, the Spark runtime/scheduler traverses the DAG starting from that RDD and triggers evaluation of anything parent RDDs it needs that aren't computed and cached yet. Any future operations build on the same DAG as long as you use the same RDD objects and, if you used cache() or persist(), can reuse the same data after it has been computed the first time. -Ewen
|
- Are all transformations lazy? David Thomas
- Re: Are all transformations lazy? Ewen Cheslack-Postava
- Re: Are all transformations lazy? David Thomas
- Re: Are all transformations lazy? Mayur Rustagi
- Re: Are all transformations lazy? Sandy Ryza
- Re: Are all transformations lazy? Ewen Cheslack-Postava
- Re: Are all transformations lazy? David Thomas