error with pyspark

Baoqiang Cao Fri, 08 Aug 2014 09:13:25 -0700

Hi There

I ran into a problem and can’t find a solution.


I was running bin/pyspark < ../python/wordcount.py

The wordcount.py is here:

========================================
import sys
from operator import add

from pyspark import SparkContext

datafile = '/mnt/data/m1.txt'

sc = SparkContext()
outfile = datafile + '.freq'
lines = sc.textFile(datafile, 1)
counts = lines.flatMap(lambda x: x.split(' ')) \
                    .map(lambda x: (x, 1)) \
                    .reduceByKey(add)
output = counts.collect()

outf = open(outfile, 'w')

for (word, count) in output:
   outf.write(word.encode('utf-8') + '\t' + str(count) + '\n')
outf.close()
========================================


The error message is here:

14/08/08 16:01:59 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 0)
java.io.FileNotFoundException: 
/tmp/spark-local-20140808160150-d36b/12/shuffle_0_0_468 (Too many open files)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
        at 
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:107)
        at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:175)
        at 
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
        at 
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:54)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)


The m1.txt is about 4G, and I have >120GB Ram and used -Xmx120GB. It is on 
Ubuntu. Any help please?

Best
Baoqiang Cao
Blog: http://baoqiang.org
Email: bqcaom...@gmail.com

error with pyspark

Reply via email to