Re: error with pyspark

Baoqiang Cao Mon, 11 Aug 2014 07:00:36 -0700

Thanks Daves and Ron!

It indeed was due to ulimit issue. Thanks a lot!
 
Best,
Baoqiang Cao
Blog: http://baoqiang.org
Email: bqcaom...@gmail.com





On Aug 11, 2014, at 3:08 AM, Ron Gonzalez <zlgonza...@yahoo.com> wrote:

> If you're running on Ubuntu, do ulimit -n, which gives the max number of 
> allowed open files. You will have to change the value in 
> /etc/security/limits.conf to something like 10000, logout and log back in.
> 
> Thanks,
> Ron
> 
> Sent from my iPad
> 
>> On Aug 10, 2014, at 10:19 PM, Davies Liu <dav...@databricks.com> wrote:
>> 
>>> On Fri, Aug 8, 2014 at 9:12 AM, Baoqiang Cao <bqcaom...@gmail.com> wrote:
>>> Hi There
>>> 
>>> I ran into a problem and can’t find a solution.
>>> 
>>> I was running bin/pyspark < ../python/wordcount.py
>> 
>> you could use bin/spark-submit  ../python/wordcount.py
>> 
>>> The wordcount.py is here:
>>> 
>>> ========================================
>>> import sys
>>> from operator import add
>>> 
>>> from pyspark import SparkContext
>>> 
>>> datafile = '/mnt/data/m1.txt'
>>> 
>>> sc = SparkContext()
>>> outfile = datafile + '.freq'
>>> lines = sc.textFile(datafile, 1)
>>> counts = lines.flatMap(lambda x: x.split(' ')) \
>>>                   .map(lambda x: (x, 1)) \
>>>                   .reduceByKey(add)
>>> output = counts.collect()
>>> 
>>> outf = open(outfile, 'w')
>>> 
>>> for (word, count) in output:
>>>  outf.write(word.encode('utf-8') + '\t' + str(count) + '\n')
>>> outf.close()
>>> ========================================
>>> 
>>> 
>>> The error message is here:
>>> 
>>> 14/08/08 16:01:59 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 0)
>>> java.io.FileNotFoundException:
>>> /tmp/spark-local-20140808160150-d36b/12/shuffle_0_0_468 (Too many open
>>> files)
>> 
>> This message means that the Spark (JVM) had reach  the max number of open 
>> files,
>> there are fd leak some where, unfortunately I can not reproduce this
>> problem.  What
>> is the version of Spark?
>> 
>>>       at java.io.FileOutputStream.open(Native Method)
>>>       at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
>>>       at
>>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:107)
>>>       at
>>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:175)
>>>       at
>>> org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
>>>       at
>>> org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
>>>       at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>>       at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>>       at
>>> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
>>>       at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>>       at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>       at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>       at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>>>       at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>       at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>       at java.lang.Thread.run(Thread.java:744)
>>> 
>>> 
>>> The m1.txt is about 4G, and I have >120GB Ram and used -Xmx120GB. It is on
>>> Ubuntu. Any help please?
>>> 
>>> Best
>>> Baoqiang Cao
>>> Blog: http://baoqiang.org
>>> Email: bqcaom...@gmail.com
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

Re: error with pyspark

Reply via email to