Statements are executed only when you try to cause some effect on the server (produce data, collect data on driver). At time of execution Spark does all the depedency resolution & truncates paths that dont go anywhere as well as optimize execution pipelines. So you really dont have to worry about these.
Important thing is if you are doing certain actions in your functions that are non-explicitly dependent on others then you may start seeing errors. For example you may write a file in hdfs during a map operations & expect to read it another map operations, according to spark map operation is not expected to alter anything apart from the RDD it is created upon, hence spark may not realize this dependency & try to parallelize the two operations, causing error . Bottom line as long as you make all your depedencies explicit in RDD, spark will take care of the magic. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Sun, Sep 7, 2014 at 12:14 AM, didata <subscripti...@didata.us> wrote: > Hello friends: > > I have a theory question about call blocking in a Spark driver. > > Consider this (admittedly contrived =:)) snippet to illustrate this > question... > > >>> x = rdd01.reduceByKey() # or maybe some other 'shuffle-requiring > action'. > > >>> b = sc.broadcast(x. take(20)) # Or any statement that requires the > previous statement to complete, cluster-wide. > > >>> y = rdd02.someAction(f(b)) > > Would the first or second statement above block because the second (or > third) statement needs to wait for the previous one to complete, > cluster-wide? > > Maybe this isn't the best example (typed on a phone), but generally I'm > trying to understand the scenario(s) where a rdd call in the driver may > block because the graph indicates that the next statement is dependent on > the completion of the current one, cluster-wide (noy just lazy evaluated). > > Thank you. :) > > Sincerely yours, > Team Dimension Data >