I running on ec2 : 1 Master : 4 CPU 15 GB RAM (2 GB swap)
2 Slaves 4 CPU 15 GB RAM the uncompressed dataset size is 15 GB On Thu, Mar 26, 2015 at 10:41 AM, Eduardo Cusa < eduardo.c...@usmediaconsulting.com> wrote: > Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory. > > I ran the same code as before, I need to make any changes? > > > > > > > On Wed, Mar 25, 2015 at 4:00 PM, Davies Liu <dav...@databricks.com> wrote: > >> With batchSize = 1, I think it will become even worse. >> >> I'd suggest to go with 1.3, have a taste for the new DataFrame API. >> >> On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa >> <eduardo.c...@usmediaconsulting.com> wrote: >> > Hi Davies, I running 1.1.0. >> > >> > Now I'm following this thread that recommend use batchsize parameter = 1 >> > >> > >> > >> http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html >> > >> > if this does not work I will install 1.2.1 or 1.3 >> > >> > Regards >> > >> > >> > >> > >> > >> > >> > On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu <dav...@databricks.com> >> wrote: >> >> >> >> What's the version of Spark you are running? >> >> >> >> There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, >> >> >> >> [1] https://issues.apache.org/jira/browse/SPARK-6055 >> >> >> >> On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa >> >> <eduardo.c...@usmediaconsulting.com> wrote: >> >> > Hi Guys, I running the following function with spark-submmit and de >> SO >> >> > is >> >> > killing my process : >> >> > >> >> > >> >> > def getRdd(self,date,provider): >> >> > path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' >> >> > log2= self.sqlContext.jsonFile(path) >> >> > log2.registerTempTable('log_test') >> >> > log2.cache() >> >> >> >> You only visit the table once, cache does not help here. >> >> >> >> > out=self.sqlContext.sql("SELECT user, tax from log_test where >> >> > provider = >> >> > '"+provider+"'and country <> ''").map(lambda row: (row.user, >> row.tax)) >> >> > print "out1" >> >> > return map((lambda (x,y): (x, list(y))), >> >> > sorted(out.groupByKey(2000).collect())) >> >> >> >> 100 partitions (or less) will be enough for 2G dataset. >> >> >> >> > >> >> > >> >> > The input dataset has 57 zip files (2 GB) >> >> > >> >> > The same process with a smaller dataset completed successfully >> >> > >> >> > Any ideas to debug is welcome. >> >> > >> >> > Regards >> >> > Eduardo >> >> > >> >> > >> > >> > >> > >