What's the version of Spark you are running?

There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3,

[1] https://issues.apache.org/jira/browse/SPARK-6055

On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa
<eduardo.c...@usmediaconsulting.com> wrote:
> Hi Guys, I running the following function with spark-submmit and de SO is
> killing my process :
>
>
>   def getRdd(self,date,provider):
>     path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
>     log2= self.sqlContext.jsonFile(path)
>     log2.registerTempTable('log_test')
>     log2.cache()

You only visit the table once, cache does not help here.

>     out=self.sqlContext.sql("SELECT user, tax from log_test where provider =
> '"+provider+"'and country <> ''").map(lambda row: (row.user, row.tax))
>     print "out1"
>     return  map((lambda (x,y): (x, list(y))),
> sorted(out.groupByKey(2000).collect()))

100 partitions (or less) will be enough for 2G dataset.

>
>
> The input dataset has 57 zip files (2 GB)
>
> The same process with a smaller dataset completed successfully
>
> Any ideas to debug is welcome.
>
> Regards
> Eduardo
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to