Could you try to remove the line `log2.cache()` ? On Thu, Mar 26, 2015 at 10:02 AM, Eduardo Cusa <eduardo.c...@usmediaconsulting.com> wrote: > I running on ec2 : > > 1 Master : 4 CPU 15 GB RAM (2 GB swap) > > 2 Slaves 4 CPU 15 GB RAM > > > the uncompressed dataset size is 15 GB > > > > > On Thu, Mar 26, 2015 at 10:41 AM, Eduardo Cusa > <eduardo.c...@usmediaconsulting.com> wrote: >> >> Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory. >> >> I ran the same code as before, I need to make any changes? >> >> >> >> >> >> >> On Wed, Mar 25, 2015 at 4:00 PM, Davies Liu <dav...@databricks.com> wrote: >>> >>> With batchSize = 1, I think it will become even worse. >>> >>> I'd suggest to go with 1.3, have a taste for the new DataFrame API. >>> >>> On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa >>> <eduardo.c...@usmediaconsulting.com> wrote: >>> > Hi Davies, I running 1.1.0. >>> > >>> > Now I'm following this thread that recommend use batchsize parameter = >>> > 1 >>> > >>> > >>> > >>> > http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html >>> > >>> > if this does not work I will install 1.2.1 or 1.3 >>> > >>> > Regards >>> > >>> > >>> > >>> > >>> > >>> > >>> > On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu <dav...@databricks.com> >>> > wrote: >>> >> >>> >> What's the version of Spark you are running? >>> >> >>> >> There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, >>> >> >>> >> [1] https://issues.apache.org/jira/browse/SPARK-6055 >>> >> >>> >> On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa >>> >> <eduardo.c...@usmediaconsulting.com> wrote: >>> >> > Hi Guys, I running the following function with spark-submmit and de >>> >> > SO >>> >> > is >>> >> > killing my process : >>> >> > >>> >> > >>> >> > def getRdd(self,date,provider): >>> >> > path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' >>> >> > log2= self.sqlContext.jsonFile(path) >>> >> > log2.registerTempTable('log_test') >>> >> > log2.cache() >>> >> >>> >> You only visit the table once, cache does not help here. >>> >> >>> >> > out=self.sqlContext.sql("SELECT user, tax from log_test where >>> >> > provider = >>> >> > '"+provider+"'and country <> ''").map(lambda row: (row.user, >>> >> > row.tax)) >>> >> > print "out1" >>> >> > return map((lambda (x,y): (x, list(y))), >>> >> > sorted(out.groupByKey(2000).collect())) >>> >> >>> >> 100 partitions (or less) will be enough for 2G dataset. >>> >> >>> >> > >>> >> > >>> >> > The input dataset has 57 zip files (2 GB) >>> >> > >>> >> > The same process with a smaller dataset completed successfully >>> >> > >>> >> > Any ideas to debug is welcome. >>> >> > >>> >> > Regards >>> >> > Eduardo >>> >> > >>> >> > >>> > >>> > >> >> >
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org