Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Here's another data point: the slow part of my code is the construction of an RDD as the union of the textFile RDDs representing data from several distinct google storage directories. So the question becomes the following: what computation happens when calling the union method on two RDDs? On Wed,

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
For Spark to connect to GCS, it utilizes the Hadoop and GCS connector jars for connectivity. I'm wondering if it's those connection points that are ultimately slowing down the connection between Spark and GCS. The reason I was asking if you could run bdutil is because it would be basically Hadoop

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Well, what do you suggest I run to test this? But more importantly, what information would this give me? On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee wrote: > > Oh, it makes sense of gsutil scans through this quickly, but I was > wondering if running a Hadoop job / bdutil would result in just as f

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
Oh, it makes sense of gsutil scans through this quickly, but I was wondering if running a Hadoop job / bdutil would result in just as fast scans? On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta wrote: > Denny, > > No, gsutil scans through the listing of the bucket quickly. See the > followi

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Denny, No, gsutil scans through the listing of the bucket quickly. See the following. alex@hadoop-m:~/split$ time bash -c "gsutil ls gs://my-bucket/20141205/csv/*/*/* | wc -l" 6860 real0m6.971s user0m1.052s sys 0m0.096s Alex On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee wrote: > >

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark -> Hadoop -> GCS Connector -> GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta wrote: > All, > > I'm using the Spark shell to interact w