I'm curious if you're seeing the same thing when using bdutil against GCS?
I'm wondering if this may be an issue concerning the transfer rate of Spark
-> Hadoop -> GCS Connector -> GCS.

On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <alexbare...@gmail.com>
wrote:

> All,
>
> I'm using the Spark shell to interact with a small test deployment of
> Spark, built from the current master branch. I'm processing a dataset
> comprising a few thousand objects on Google Cloud Storage, split into a
> half dozen directories. My code constructs an object--let me call it the
> Dataset object--that defines a distinct RDD for each directory. The
> constructor of the object only defines the RDDs; it does not actually
> evaluate them, so I would expect it to return very quickly. Indeed, the
> logging code in the constructor prints a line signaling the completion of
> the code almost immediately after invocation, but the Spark shell does not
> show the prompt right away. Instead, it spends a few minutes seemingly
> frozen, eventually producing the following output:
>
> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
> process : 9
>
> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
> process : 759
>
> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
> process : 228
>
> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
> process : 3076
>
> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
> process : 1013
>
> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
> process : 156
>
> This stage is inexplicably slow. What could be happening?
>
> Thanks.
>
>
> Alex
>

Reply via email to