Re: Missing Spark URL after staring the master

Ognen Duzlevski Mon, 03 Mar 2014 13:08:32 -0800

I should add that in this setup you really do not need to look for theprintout of the master node's IP - you set it yourself a priori. Ifanyone is interested, let me know, I can write it all up so that peoplecan follow some set of instructions. Who knows, maybe I can come up witha set of scripts to automate it all...


Ognen



On 3/3/14, 3:02 PM, Ognen Duzlevski wrote:

I have a Standalone spark cluster running in an Amazon VPC that I setup by hand. All I did was provision the machines from a common AMIimage (my underlying distribution is Ubuntu), I created a "sparkuser"on each machine and I have a /home/sparkuser/spark folder where Idownladed spark. I did this on the master only, I did sbt/sbt assembleand I set up the conf/spark-env.sh to point to the master which is anIP address (in my case 10.10.0.200, the port is the default 7077). Ialso set up the slaves file in the same subdirectory to have all 16 ipaddresses of the worker nodes (in my case 10.10.0.201-216). Aftersbt/sbt assembly was done on master, I then did cd ~/; tar -czfspark.tgz spark/ and I copied the resulting tgz file to each workerusing the same "sparkuser" account and unpacked the .tgz on each slave(this will effectively replicate everything from master to all slaves- you can script this so you don't do it by hand).
Your AMI should have the distribution's version of Java and gitinstalled by the way.
All you have to do then is sparkuser@spark-master>spark/sbin/start-all.sh (for 0.9, in 0.8.1 it isspark/bin/start-all.sh) and it will all automagically start :)
All my Amazon nodes come with 4x400 Gb of ephemeral space which I haveset up into a 1.6TB RAID0 array on each node and I am pooling thisinto an HDFS filesystem which is operated by a namenode outside thespark cluster while all the datanodes are the same nodes as the sparkworkers. This enables replication and extremely fast access sinceephemeral is much faster than EBS or anything else on Amazon (you cando even better with SSD drives on this setup but it will cost ya).
If anyone is interested I can document our pipeline set up - I came upwith it myself and do not have a clue as to what the industrystandards are since I could not find any written instructions anywhereonline about how to set up a whole data analytics pipeline from thepoint of ingestion to the point of analytics (people don't want toshare their secrets? or am I just in the dark and incapable of usingGoogle properly?). My requirement was that I wanted this to run withina VPC for added security and simplicity, the Amazon security groupsget really old quickly. Added bonus is that you can use a VPN as anentry into the whole system and your cluster instantly becomes "local"to you in terms of IPs etc. I use OpenVPN since I don't like Cisco norJuniper (the only two options Amazon provides for their VPN gateways).
Ognen


On 3/3/14, 1:00 PM, Bin Wang wrote:
Hi there,
I have a CDH cluster set up, and I tried using the Spark parcel comewith Cloudera Manager, but it turned out they even don't have therun-example shell command in the bin folder. Then I removed it fromthe cluster and cloned the incubator-spark into the name node of mycluster, and built from source there successfully with everything asdefault.
I ran a few examples and everything seems work fine in the localmode. Then I am thinking about scale it to my cluster, which is whatthe "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I wantto add all the datanodes to the slaves and think I should run Sparkin the standalone mode.
Say I am trying to set up Spark in the standalone mode following thisinstruction:
https://spark.incubator.apache.org/docs/latest/spark-standalone.html
However, it says "Once started, the master will print out a|spark://HOST:PORT| URL for itself, which you can use to connectworkers to it, or pass as the "master" argument to |SparkContext|.You can also find this URL on the master's web UI, which ishttp://localhost:8080 <http://localhost:8080/> by default."
After I started the master, there is no URL printed on the screen andneither the web UI is running.
Here is the output:
[root@box incubator-spark]# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to/root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out
First Question: am I even in the ballpark to run Spark in standalonemode if I try to fully utilize my cluster? I saw there are four waysto launch Spark on a cluster, AWS-EC2, Spark in standalone, ApacheMeso, Hadoop Yarn... which I guess standalone mode is the way to go?
Second Question: how to get the Spark URL of the cluster, why theoutput is not like what the instruction says?
Best regards,

Bin
--
Some people, when confronted with a problem, think "I know, I'll use regular 
expressions." Now they have two problems.
-- Jamie Zawinski


--
Some people, when confronted with a problem, think "I know, I'll use regular 
expressions." Now they have two problems.
-- Jamie Zawinski

Re: Missing Spark URL after staring the master

Reply via email to