Hi Guys, I running the following function with spark-submmit and de SO is
killing my process :
def getRdd(self,date,provider):
path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
log2= self.sqlContext.jsonFile(path)
log2.registerTempTable('log_test')
log2.cache()
out=self.sqlContext.sql("SELECT user, tax from log_test where provider
= '"+provider+"'and country <> ''").map(lambda row: (row.user, row.tax))
print "out1"
return map((lambda (x,y): (x, list(y))),
sorted(out.groupByKey(2000).collect()))
The input dataset has 57 zip files (2 GB)
The same process with a smaller dataset completed successfully
Any ideas to debug is welcome.
Regards
Eduardo