Exception on simple pyspark script

idanzalz Fri, 28 Mar 2014 01:00:08 -0700

Hi,
I am a newbie with Spark.
I tried installing 2 virtual machines, one as a client and one as standalone
mode worker+master.
Everything seems to run and connect fine, but when I try to run a simple
script, I get weird errors.


Here is the traceback, notice my program is just a one-liner:


vagrant@precise32:/usr/local/spark$ MASTER=spark://192.168.16.109:7077
bin/pyspark
Python 2.7.3 (default, Apr 20 2012, 22:44:07)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
14/03/28 06:45:54 INFO Utils: Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
14/03/28 06:45:54 WARN Utils: Your hostname, precise32 resolves to a
loopback address: 127.0.1.1; using 192.168.16.107 instead (on interface
eth0)
14/03/28 06:45:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
another address
14/03/28 06:45:55 INFO Slf4jLogger: Slf4jLogger started
14/03/28 06:45:55 INFO Remoting: Starting remoting
14/03/28 06:45:55 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@192.168.16.107:55440]
14/03/28 06:45:55 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@192.168.16.107:55440]
14/03/28 06:45:55 INFO SparkEnv: Registering BlockManagerMaster
14/03/28 06:45:55 INFO DiskBlockManager: Created local directory at
/tmp/spark-local-20140328064555-5a1f
14/03/28 06:45:55 INFO MemoryStore: MemoryStore started with capacity 297.0
MB.
14/03/28 06:45:55 INFO ConnectionManager: Bound socket to port 55114 with id
= ConnectionManagerId(192.168.16.107,55114)
14/03/28 06:45:55 INFO BlockManagerMaster: Trying to register BlockManager
14/03/28 06:45:55 INFO BlockManagerMasterActor$BlockManagerInfo: Registering
block manager 192.168.16.107:55114 with 297.0 MB RAM
14/03/28 06:45:55 INFO BlockManagerMaster: Registered BlockManager
14/03/28 06:45:55 INFO HttpServer: Starting HTTP Server
14/03/28 06:45:55 INFO HttpBroadcast: Broadcast server started at
http://192.168.16.107:58268
14/03/28 06:45:55 INFO SparkEnv: Registering MapOutputTracker
14/03/28 06:45:55 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-2a1f1a0b-f4d9-402a-ac17-a41d9f9aea0c
14/03/28 06:45:55 INFO HttpServer: Starting HTTP Server
14/03/28 06:45:56 INFO SparkUI: Started Spark Web UI at
http://192.168.16.107:4040
14/03/28 06:45:56 INFO AppClient$ClientActor: Connecting to master
spark://192.168.16.109:7077...
14/03/28 06:45:56 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 0.9.0
      /_/

Using Python version 2.7.3 (default, Apr 20 2012 22:44:07)
Spark context available as sc.
>>> 14/03/28 06:45:58 INFO SparkDeploySchedulerBackend: Connected to Spark
>>> cluster with app ID app-20140327234558-0000
14/03/28 06:47:03 INFO AppClient$ClientActor: Executor added:
app-20140327234558-0000/0 on worker-20140327234702-192.168.16.109-41619
(192.168.16.109:41619) with 1 cores
14/03/28 06:47:03 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20140327234558-0000/0 on hostPort 192.168.16.109:41619 with 1 cores,
512.0 MB RAM
14/03/28 06:47:04 INFO AppClient$ClientActor: Executor updated:
app-20140327234558-0000/0 is now RUNNING
14/03/28 06:47:06 INFO SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor@192.168.16.109:45642/user/Executor#-154634467]
with ID 0
14/03/28 06:47:07 INFO BlockManagerMasterActor$BlockManagerInfo: Registering
block manager 192.168.16.109:60587 with 297.0 MB RAM

>>>
>>> sc.parallelize([1,2]).count()

14/03/28 06:47:35 INFO SparkContext: Starting job: count at <stdin>:1
14/03/28 06:47:35 INFO DAGScheduler: Got job 0 (count at <stdin>:1) with 2
output partitions (allowLocal=false)
14/03/28 06:47:35 INFO DAGScheduler: Final stage: Stage 0 (count at
<stdin>:1)
14/03/28 06:47:35 INFO DAGScheduler: Parents of final stage: List()
14/03/28 06:47:35 INFO DAGScheduler: Missing parents: List()
14/03/28 06:47:35 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at
count at <stdin>:1), which has no missing parents
14/03/28 06:47:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0
(PythonRDD[1] at count at <stdin>:1)
14/03/28 06:47:35 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/03/28 06:47:35 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on
executor 0: 192.168.16.109 (PROCESS_LOCAL)
14/03/28 06:47:35 INFO TaskSetManager: Serialized task 0.0:0 as 2546 bytes
in 4 ms
14/03/28 06:47:37 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on
executor 0: 192.168.16.109 (PROCESS_LOCAL)
14/03/28 06:47:37 INFO TaskSetManager: Serialized task 0.0:1 as 2546 bytes
in 1 ms
14/03/28 06:47:37 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/03/28 06:47:37 WARN TaskSetManager: Loss was due to
org.apache.spark.api.python.PythonException
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File "/usr/local/spark/python/pyspark/worker.py", line 77, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/pyspark/serializers.py", line 182, in
dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/usr/local/spark/python/pyspark/serializers.py", line 117, in
dump_stream
    for obj in iterator:
  File "/usr/local/spark/python/pyspark/serializers.py", line 171, in
_batched
    for item in iterator:
  File "/usr/local/spark/python/pyspark/rdd.py", line 493, in func
    if acc is None:
TypeError: an integer is required

        at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
        at
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
        at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
        at org.apache.spark.scheduler.Task.run(Task.scala:53)
        at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
        at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:46)
        at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:45)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Exception-on-simple-pyspark-script-tp3415.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Exception on simple pyspark script

Reply via email to