What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3,
[1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa <eduardo.c...@usmediaconsulting.com> wrote: > Hi Guys, I running the following function with spark-submmit and de SO is > killing my process : > > > def getRdd(self,date,provider): > path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' > log2= self.sqlContext.jsonFile(path) > log2.registerTempTable('log_test') > log2.cache() You only visit the table once, cache does not help here. > out=self.sqlContext.sql("SELECT user, tax from log_test where provider = > '"+provider+"'and country <> ''").map(lambda row: (row.user, row.tax)) > print "out1" > return map((lambda (x,y): (x, list(y))), > sorted(out.groupByKey(2000).collect())) 100 partitions (or less) will be enough for 2G dataset. > > > The input dataset has 57 zip files (2 GB) > > The same process with a smaller dataset completed successfully > > Any ideas to debug is welcome. > > Regards > Eduardo > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org