I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark -> Hadoop -> GCS Connector -> GCS.
On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <alexbare...@gmail.com> wrote: > All, > > I'm using the Spark shell to interact with a small test deployment of > Spark, built from the current master branch. I'm processing a dataset > comprising a few thousand objects on Google Cloud Storage, split into a > half dozen directories. My code constructs an object--let me call it the > Dataset object--that defines a distinct RDD for each directory. The > constructor of the object only defines the RDDs; it does not actually > evaluate them, so I would expect it to return very quickly. Indeed, the > logging code in the constructor prints a line signaling the completion of > the code almost immediately after invocation, but the Spark shell does not > show the prompt right away. Instead, it spends a few minutes seemingly > frozen, eventually producing the following output: > > 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to > process : 9 > > 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to > process : 759 > > 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to > process : 228 > > 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to > process : 3076 > > 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to > process : 1013 > > 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to > process : 156 > > This stage is inexplicably slow. What could be happening? > > Thanks. > > > Alex >