RE:

Buttler, David Wed, 23 Apr 2014 18:11:25 -0700

This sounds like a configuration issue.  Either you have not set the MASTER 
correctly, or possibly another process is using up all of the cores
Dave

From: ge ko [mailto:koenig....@gmail.com]
Sent: Sunday, April 13, 2014 12:51 PM
To: user@spark.apache.org
Subject:

Hi,

I'm still going to start working with Spark and installed the parcels in our 
CDH5 GA cluster.

Master: hadoop-pg-5.cluster, Worker: hadoop-pg-7.cluster

Like some advices told me to use FQDN, the settings above sound reasonable for 
me .

Both daemons are running, Master-Web-UI shows the connected worker, and the log 
entries show:

master:

2014-04-13 21:26:40,641 INFO Remoting: Starting remoting
2014-04-13 21:26:40,930 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077]
2014-04-13 21:26:41,356 INFO org.apache.spark.deploy.master.Master: Starting 
Spark master at spark://hadoop-pg-5.cluster:7077
...

2014-04-13 21:26:41,439 INFO org.eclipse.jetty.server.AbstractConnector: 
Started 
SelectChannelConnector@0.0.0.0:18080<http://SelectChannelConnector@0.0.0.0:18080>
2014-04-13 21:26:41,441 INFO org.apache.spark.deploy.master.ui.MasterWebUI: 
Started Master web UI at http://hadoop-pg-5.cluster:18080
2014-04-13 21:26:41,476 INFO org.apache.spark.deploy.master.Master: I have been 
elected leader! New state: ALIVE

2014-04-13 21:27:40,319 INFO org.apache.spark.deploy.master.Master: Registering 
worker hadoop-pg-5.cluster:7078 with 2 cores, 64.0 MB RAM

worker:

2014-04-13 21:27:39,037 INFO akka.event.slf4j.Slf4jLogger: Slf4jLogger started
2014-04-13 21:27:39,136 INFO Remoting: Starting remoting
2014-04-13 21:27:39,413 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkWorker@hadoop-pg-7.cluster:7078]
2014-04-13 21:27:39,706 INFO org.apache.spark.deploy.worker.Worker: Starting 
Spark worker hadoop-pg-7.cluster:7078 with 2 cores, 64.0 MB RAM
2014-04-13 21:27:39,708 INFO org.apache.spark.deploy.worker.Worker: Spark home: 
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark
...

2014-04-13 21:27:39,888 INFO org.eclipse.jetty.server.AbstractConnector: 
Started 
SelectChannelConnector@0.0.0.0:18081<http://SelectChannelConnector@0.0.0.0:18081>
2014-04-13 21:27:39,889 INFO org.apache.spark.deploy.worker.ui.WorkerWebUI: 
Started Worker web UI at http://hadoop-pg-7.cluster:18081
2014-04-13 21:27:39,890 INFO org.apache.spark.deploy.worker.Worker: Connecting 
to master spark://hadoop-pg-5.cluster:7077...
2014-04-13 21:27:40,360 INFO org.apache.spark.deploy.worker.Worker: 
Successfully registered with master spark://hadoop-pg-5.cluster:7077

Looks good, so far.

Now I want to execute the python pi example by executing (on the worker):

cd /opt/cloudera/parcels/CDH/lib/spark && ./bin/pyspark ./python/examples/pi.py 
spark://hadoop-pg-5.cluster:7077

Here the strange thing happens, the script doesn't get executed, it hangs 
(repeating this output forever) at :

14/04/13 21:31:03 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient memory
14/04/13 21:31:18 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient memory

The whole log is:

14/04/13 21:30:44 INFO Slf4jLogger: Slf4jLogger started
14/04/13 21:30:45 INFO Remoting: Starting remoting
14/04/13 21:30:45 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@hadoop-pg-7.cluster:50601]
14/04/13 21:30:45 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@hadoop-pg-7.cluster:50601]
14/04/13 21:30:45 INFO SparkEnv: Registering BlockManagerMaster
14/04/13 21:30:45 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20140413213045-acec
14/04/13 21:30:45 INFO MemoryStore: MemoryStore started with capacity 294.9 MB.
14/04/13 21:30:45 INFO ConnectionManager: Bound socket to port 57506 with id = 
ConnectionManagerId(hadoop-pg-7.cluster,57506)
14/04/13 21:30:45 INFO BlockManagerMaster: Trying to register BlockManager
14/04/13 21:30:45 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
block manager hadoop-pg-7.cluster:57506 with 294.9 MB RAM
14/04/13 21:30:45 INFO BlockManagerMaster: Registered BlockManager
14/04/13 21:30:45 INFO HttpServer: Starting HTTP Server
14/04/13 21:30:45 INFO HttpBroadcast: Broadcast server started at 
http://10.147.210.7:51224
14/04/13 21:30:45 INFO SparkEnv: Registering MapOutputTracker
14/04/13 21:30:45 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-f9ab98c8-2adf-460a-9099-6dc07c7dc89f
14/04/13 21:30:45 INFO HttpServer: Starting HTTP Server
14/04/13 21:30:46 INFO SparkUI: Started Spark Web UI at 
http://hadoop-pg-7.cluster:4040
14/04/13 21:30:46 INFO AppClient$ClientActor: Connecting to master 
spark://hadoop-pg-5.cluster:7077...
14/04/13 21:30:47 INFO SparkDeploySchedulerBackend: Connected to Spark cluster 
with app ID app-20140413213046-0000
14/04/13 21:30:48 INFO SparkContext: Starting job: reduce at 
./python/examples/pi.py:36
14/04/13 21:30:48 INFO DAGScheduler: Got job 0 (reduce at 
./python/examples/pi.py:36) with 2 output partitions (allowLocal=false)
14/04/13 21:30:48 INFO DAGScheduler: Final stage: Stage 0 (reduce at 
./python/examples/pi.py:36)
14/04/13 21:30:48 INFO DAGScheduler: Parents of final stage: List()
14/04/13 21:30:48 INFO DAGScheduler: Missing parents: List()
14/04/13 21:30:48 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at reduce 
at ./python/examples/pi.py:36), which has no missing parents
14/04/13 21:30:48 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
(PythonRDD[1] at reduce at ./python/examples/pi.py:36)
14/04/13 21:30:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/04/13 21:31:03 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient memory
14/04/13 21:31:18 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient memory

Thereby I have to cancel the execution of the script. If I am doing this, I 
receive the following log entries on the master (! at cancellation of the 
python pi script !):

2014-04-13 21:30:46,965 INFO org.apache.spark.deploy.master.Master: Registering 
app PythonPi
2014-04-13 21:30:46,974 INFO org.apache.spark.deploy.master.Master: Registered 
app PythonPi with ID app-20140413213046-0000
2014-04-13 21:31:27,123 INFO org.apache.spark.deploy.master.Master: 
akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.
2014-04-13 21:31:27,125 INFO org.apache.spark.deploy.master.Master: Removing 
app app-20140413213046-0000
2014-04-13 21:31:27,143 INFO org.apache.spark.deploy.master.Master: 
akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.
2014-04-13 21:31:27,144 INFO akka.actor.LocalActorRef: Message 
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
Actor[akka://sparkMaster/deadLetters] to 
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.147.210.7%3A44207-2#-389971336]
 was not delivered. [1] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
2014-04-13 21:31:27,194 ERROR akka.remote.EndpointWriter: AssociationError 
[akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077] -> 
[akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed with 
[akka.tcp://spark@hadoop-pg-7.cluster:50601]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://spark@hadoop-pg-7.cluster:50601]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: 
hadoop-pg-7.cluster/10.147.210.7:50601<http://10.147.210.7:50601>
]
2014-04-13 21:31:27,199 INFO org.apache.spark.deploy.master.Master: 
akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.
2014-04-13 21:31:27,215 ERROR akka.remote.EndpointWriter: AssociationError 
[akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077] -> 
[akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed with 
[akka.tcp://spark@hadoop-pg-7.cluster:50601]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://spark@hadoop-pg-7.cluster:50601]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: 
hadoop-pg-7.cluster/10.147.210.7:50601<http://10.147.210.7:50601>
]
2014-04-13 21:31:27,222 INFO org.apache.spark.deploy.master.Master: 
akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.
2014-04-13 21:31:27,234 ERROR akka.remote.EndpointWriter: AssociationError 
[akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077] -> 
[akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed with 
[akka.tcp://spark@hadoop-pg-7.cluster:50601]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://spark@hadoop-pg-7.cluster:50601]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: 
hadoop-pg-7.cluster/10.147.210.7:50601<http://10.147.210.7:50601>
]
2014-04-13 21:31:27,238 INFO org.apache.spark.deploy.master.Master: 
akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.

What is going wrong here ?!?!?!?

I get the same behaviour if I start the spark-shell on the worker and try to 
execute e.g. sc.parallelize(1 to 100,10).count

any help highly appreciated, Gerd

RE:

Reply via email to