Hi, I am a newbie with Spark. I tried installing 2 virtual machines, one as a client and one as standalone mode worker+master. Everything seems to run and connect fine, but when I try to run a simple script, I get weird errors.
Here is the traceback, notice my program is just a one-liner: vagrant@precise32:/usr/local/spark$ MASTER=spark://192.168.16.109:7077 bin/pyspark Python 2.7.3 (default, Apr 20 2012, 22:44:07) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. 14/03/28 06:45:54 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/03/28 06:45:54 WARN Utils: Your hostname, precise32 resolves to a loopback address: 127.0.1.1; using 192.168.16.107 instead (on interface eth0) 14/03/28 06:45:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/03/28 06:45:55 INFO Slf4jLogger: Slf4jLogger started 14/03/28 06:45:55 INFO Remoting: Starting remoting 14/03/28 06:45:55 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@192.168.16.107:55440] 14/03/28 06:45:55 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@192.168.16.107:55440] 14/03/28 06:45:55 INFO SparkEnv: Registering BlockManagerMaster 14/03/28 06:45:55 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140328064555-5a1f 14/03/28 06:45:55 INFO MemoryStore: MemoryStore started with capacity 297.0 MB. 14/03/28 06:45:55 INFO ConnectionManager: Bound socket to port 55114 with id = ConnectionManagerId(192.168.16.107,55114) 14/03/28 06:45:55 INFO BlockManagerMaster: Trying to register BlockManager 14/03/28 06:45:55 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager 192.168.16.107:55114 with 297.0 MB RAM 14/03/28 06:45:55 INFO BlockManagerMaster: Registered BlockManager 14/03/28 06:45:55 INFO HttpServer: Starting HTTP Server 14/03/28 06:45:55 INFO HttpBroadcast: Broadcast server started at http://192.168.16.107:58268 14/03/28 06:45:55 INFO SparkEnv: Registering MapOutputTracker 14/03/28 06:45:55 INFO HttpFileServer: HTTP File server directory is /tmp/spark-2a1f1a0b-f4d9-402a-ac17-a41d9f9aea0c 14/03/28 06:45:55 INFO HttpServer: Starting HTTP Server 14/03/28 06:45:56 INFO SparkUI: Started Spark Web UI at http://192.168.16.107:4040 14/03/28 06:45:56 INFO AppClient$ClientActor: Connecting to master spark://192.168.16.109:7077... 14/03/28 06:45:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 0.9.0 /_/ Using Python version 2.7.3 (default, Apr 20 2012 22:44:07) Spark context available as sc. >>> 14/03/28 06:45:58 INFO SparkDeploySchedulerBackend: Connected to Spark >>> cluster with app ID app-20140327234558-0000 14/03/28 06:47:03 INFO AppClient$ClientActor: Executor added: app-20140327234558-0000/0 on worker-20140327234702-192.168.16.109-41619 (192.168.16.109:41619) with 1 cores 14/03/28 06:47:03 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140327234558-0000/0 on hostPort 192.168.16.109:41619 with 1 cores, 512.0 MB RAM 14/03/28 06:47:04 INFO AppClient$ClientActor: Executor updated: app-20140327234558-0000/0 is now RUNNING 14/03/28 06:47:06 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@192.168.16.109:45642/user/Executor#-154634467] with ID 0 14/03/28 06:47:07 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager 192.168.16.109:60587 with 297.0 MB RAM >>> >>> sc.parallelize([1,2]).count() 14/03/28 06:47:35 INFO SparkContext: Starting job: count at <stdin>:1 14/03/28 06:47:35 INFO DAGScheduler: Got job 0 (count at <stdin>:1) with 2 output partitions (allowLocal=false) 14/03/28 06:47:35 INFO DAGScheduler: Final stage: Stage 0 (count at <stdin>:1) 14/03/28 06:47:35 INFO DAGScheduler: Parents of final stage: List() 14/03/28 06:47:35 INFO DAGScheduler: Missing parents: List() 14/03/28 06:47:35 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at count at <stdin>:1), which has no missing parents 14/03/28 06:47:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[1] at count at <stdin>:1) 14/03/28 06:47:35 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/03/28 06:47:35 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 0: 192.168.16.109 (PROCESS_LOCAL) 14/03/28 06:47:35 INFO TaskSetManager: Serialized task 0.0:0 as 2546 bytes in 4 ms 14/03/28 06:47:37 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor 0: 192.168.16.109 (PROCESS_LOCAL) 14/03/28 06:47:37 INFO TaskSetManager: Serialized task 0.0:1 as 2546 bytes in 1 ms 14/03/28 06:47:37 WARN TaskSetManager: Lost TID 0 (task 0.0:0) 14/03/28 06:47:37 WARN TaskSetManager: Loss was due to org.apache.spark.api.python.PythonException org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/spark/python/pyspark/worker.py", line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/local/spark/python/pyspark/serializers.py", line 182, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/usr/local/spark/python/pyspark/serializers.py", line 117, in dump_stream for obj in iterator: File "/usr/local/spark/python/pyspark/serializers.py", line 171, in _batched for item in iterator: File "/usr/local/spark/python/pyspark/rdd.py", line 493, in func if acc is None: TypeError: an integer is required at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:46) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:45) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exception-on-simple-pyspark-script-tp3415.html Sent from the Apache Spark User List mailing list archive at Nabble.com.