Re: Apache Spark master value question

Alex Ott Sun, 17 May 2020 06:36:27 -0700

I've seen somewhere in CDH documentation that they use MASTER, that's why
I'm asking...


On Sun, May 17, 2020 at 3:13 PM Patrik Iselind <[email protected]> wrote:

> Thanks a lot for creating the issue. It seems I am not allowed to.
>
> As I understand it, the environment variable is supposed to be
> SPARK_MASTER.
>
> On Sun, May 17, 2020 at 11:56 AM Alex Ott <[email protected]> wrote:
>
>> Ok, I've created a JIRA for it:
>> https://issues.apache.org/jira/browse/ZEPPELIN-4821 and working on patch
>>
>> I'm not sure about environment variable name - it's simply MASTER, should
>> it be `SPARK_MASTER`, or it's a requirement of CDH and other Hadoop
>> distributions to have it as MASTER?
>>
>> On Sat, May 16, 2020 at 3:45 PM Patrik Iselind <[email protected]>
>> wrote:
>>
>>> Hi Alex,
>>>
>>> Thanks a lot for helping out with this.
>>>
>>> You're correct, but it doesn't seem that it's the
>>> interpreter-settings.json for Spark interpreter that is being used. It's
>>> conf/interpreter.json. In this file both 0.8.2 and 0.9.0 have
>>> ```partial-json
>>>     "spark": {
>>>       "id": "spark",
>>>       "name": "spark",
>>>       "group": "spark",
>>>       "properties": {
>>>         "SPARK_HOME": {
>>>           "name": "SPARK_HOME",
>>>           "value": "",
>>>           "type": "string",
>>>           "description": "Location of spark distribution"
>>>         },
>>>         "master": {
>>>           "name": "master",
>>>           "value": "local[*]",
>>>           "type": "string",
>>>           "description": "Spark master uri. local | yarn-client |
>>> yarn-cluster | spark master address of standalone mode, ex)
>>> spark://master_host:7077"
>>>         },
>>> ```
>>> That "master" should be "spark.master".
>>>
>>> By adding an explicit spark.master with the value "local[*]" I can use
>>> all cores as expected. Without this and printing sc.master I get "local".
>>> With the addition of the spark.master property set to "local[*]" and
>>> printing sc.master I get "local[*]". My conclusion is
>>> that conf/interpreter.json isn't in sync with the interpreter-settings.json
>>> for Spark interpreter.
>>>
>>> Best regards,
>>> Patrik Iselind
>>>
>>>
>>> On Sat, May 16, 2020 at 11:22 AM Alex Ott <[email protected]> wrote:
>>>
>>>> Spark master is set to `local[*]` by default. Here is corresponding
>>>> piece
>>>> form interpreter-settings.json for Spark interpreter:
>>>>
>>>>       "master": {
>>>>         "envName": "MASTER",
>>>>         "propertyName": "spark.master",
>>>>         "defaultValue": "local[*]",
>>>>         "description": "Spark master uri. local | yarn-client |
>>>> yarn-cluster | spark master address of standalone mode, ex)
>>>> spark://master_host:7077",
>>>>         "type": "string"
>>>>       },
>>>>
>>>>
>>>> Patrik Iselind  at "Sun, 10 May 2020 20:31:08 +0200" wrote:
>>>>  PI> Hi Jeff,
>>>>
>>>>  PI> I've tried the release from
>>>> http://zeppelin.apache.org/download.html, both in a docker and without
>>>> a docker. They both have the same issue as
>>>>  PI> previously described.
>>>>
>>>>  PI> Can I somehow set spark.master to "local[*]" in zeppelin, perhaps
>>>> using some environment variable?
>>>>
>>>>  PI> When is the next Zeppelin 0.9.0 docker image planned to be
>>>> released?
>>>>
>>>>  PI> Best Regards,
>>>>  PI> Patrik Iselind
>>>>
>>>>  PI> On Sun, May 10, 2020 at 9:26 AM Jeff Zhang <[email protected]>
>>>> wrote:
>>>>
>>>>  PI>     Hi Patric,
>>>>  PI>
>>>>  PI>     Do you mind to try the 0.9.0-preview, it might be an issue of
>>>> docker container.
>>>>  PI>
>>>>  PI>     http://zeppelin.apache.org/download.html
>>>>
>>>>  PI>     Patrik Iselind <[email protected]> 于2020年5月10日周日上午2:30写道：
>>>>  PI>
>>>>  PI>         Hello Jeff,
>>>>  PI>
>>>>  PI>         Thank you for looking into this for me.
>>>>  PI>
>>>>  PI>         Using the latest pushed docker image for 0.9.0 (image
>>>> ID 92890adfadfb, built 6 weeks ago), I still see the same issue. My image
>>>> has
>>>>  PI>         the digest "apache/zeppelin@sha256
>>>> :0691909f6884319d366f5d3a5add8802738d6240a83b2e53e980caeb6c658092".
>>>>  PI>
>>>>  PI>         If it's not on the tip of master, could you guys please
>>>> release a newer 0.9.0 image?
>>>>  PI>
>>>>  PI>         Best Regards,
>>>>  PI>         Patrik Iselind
>>>>
>>>>  PI>         On Sat, May 9, 2020 at 4:03 PM Jeff Zhang <
>>>> [email protected]> wrote:
>>>>  PI>
>>>>  PI>             This might be a bug of 0.8, I tried that in 0.9
>>>> (master branch), it works for me.
>>>>  PI>
>>>>  PI>             print(sc.master)
>>>>  PI>             print(sc.defaultParallelism)
>>>>  PI>
>>>>  PI>             ---
>>>>  PI>             local[*] 8
>>>>
>>>>  PI>             Patrik Iselind <[email protected]>
>>>> 于2020年5月9日周六下午8:34写道：
>>>>  PI>
>>>>  PI>                 Hi,
>>>>  PI>
>>>>  PI>                 First comes some background, then I have some
>>>> questions.
>>>>  PI>
>>>>  PI>                 Background
>>>>  PI>                 I'm trying out Zeppelin 0.8.2 based on the Docker
>>>> image. My Docker file looks like this:
>>>>  PI>
>>>>  PI>                 ```Dockerfile
>>>>  PI>                 FROM apache/zeppelin:0.8.2
>>>>
>>>>  PI>
>>>>  PI>                 # Install Java and some tools
>>>>  PI>                 RUN apt-get -y update &&\
>>>>  PI>                     DEBIAN_FRONTEND=noninteractive \
>>>>  PI>                         apt -y install vim python3-pip
>>>>  PI>
>>>>  PI>                 RUN python3 -m pip install -U pyspark
>>>>  PI>
>>>>  PI>                 ENV PYSPARK_PYTHON python3
>>>>  PI>                 ENV PYSPARK_DRIVER_PYTHON python3
>>>>  PI>                 ```
>>>>  PI>
>>>>  PI>                 When I start a section like so
>>>>  PI>
>>>>  PI>                 ```Zeppelin paragraph
>>>>  PI>                 %pyspark
>>>>  PI>
>>>>  PI>                 print(sc)
>>>>  PI>                 print()
>>>>  PI>                 print(dir(sc))
>>>>  PI>                 print()
>>>>  PI>                 print(sc.master)
>>>>  PI>                 print()
>>>>  PI>                 print(sc.defaultParallelism)
>>>>  PI>                 ```
>>>>  PI>
>>>>  PI>                 I get the following output
>>>>  PI>
>>>>  PI>                 ```output
>>>>  PI>                 <SparkContext master=local appName=Zeppelin>
>>>> ['PACKAGE_EXTENSIONS', '__class__', '__delattr__', '__dict__', '__dir__',
>>>>  PI>                 '__doc__', '__enter__', '__eq__', '__exit__',
>>>> '__format__', '__ge__', '__getattribute__', '__getnewargs__', '__gt__',
>>>>  PI>                 '__hash__', '__init__', '__le__', '__lt__',
>>>> '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
>>>> '__repr__',
>>>>  PI>                 '__setattr__', '__sizeof__', '__str__',
>>>> '__subclasshook__', '__weakref__', '_accumulatorServer',
>>>> '_active_spark_context',
>>>>  PI>                 '_batchSize', '_callsite', '_checkpointFile',
>>>> '_conf', '_dictToJavaMap', '_do_init', '_ensure_initialized', '_gateway',
>>>>  PI>                 '_getJavaStorageLevel', '_initialize_context',
>>>> '_javaAccumulator', '_jsc', '_jvm', '_lock', '_next_accum_id',
>>>>  PI>                 '_pickled_broadcast_vars', '_python_includes',
>>>> '_repr_html_', '_temp_dir', '_unbatched_serializer', 'accumulator',
>>>> 'addFile',
>>>>  PI>                 'addPyFile', 'appName', 'applicationId',
>>>> 'binaryFiles', 'binaryRecords', 'broadcast', 'cancelAllJobs',
>>>> 'cancelJobGroup',
>>>>  PI>                 'defaultMinPartitions', 'defaultParallelism',
>>>> 'dump_profiles', 'emptyRDD', 'environment', 'getConf', 'getLocalProperty',
>>>>  PI>                 'getOrCreate', 'hadoopFile', 'hadoopRDD',
>>>> 'master', 'newAPIHadoopFile', 'newAPIHadoopRDD', 'parallelize',
>>>> 'pickleFile',
>>>>  PI>                 'profiler_collector', 'pythonExec', 'pythonVer',
>>>> 'range', 'runJob', 'sequenceFile', 'serializer', 'setCheckpointDir',
>>>>  PI>                 'setJobGroup', 'setLocalProperty', 'setLogLevel',
>>>> 'setSystemProperty', 'show_profiles', 'sparkHome', 'sparkUser', 
>>>> 'startTime',
>>>>  PI>                 'statusTracker', 'stop', 'textFile', 'uiWebUrl',
>>>> 'union', 'version', 'wholeTextFiles'] local 1
>>>>  PI>                 ```
>>>>  PI>
>>>>  PI>                 This even though the "master" property in the
>>>> interpretter is set to "local[*]". I'd like to use all cores on my machine.
>>>> To
>>>>  PI>                 do that I have to explicitly create the
>>>> "spark.master" property in the spark interpretter with the value
>>>> "local[*]", then I
>>>>  PI>                 get
>>>>  PI>
>>>>  PI>                 ```new output
>>>>  PI>                 <SparkContext master=local[*] appName=Zeppelin>
>>>> ['PACKAGE_EXTENSIONS', '__class__', '__delattr__', '__dict__', '__dir__',
>>>>  PI>                 '__doc__', '__enter__', '__eq__', '__exit__',
>>>> '__format__', '__ge__', '__getattribute__', '__getnewargs__', '__gt__',
>>>>  PI>                 '__hash__', '__init__', '__le__', '__lt__',
>>>> '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
>>>> '__repr__',
>>>>  PI>                 '__setattr__', '__sizeof__', '__str__',
>>>> '__subclasshook__', '__weakref__', '_accumulatorServer',
>>>> '_active_spark_context',
>>>>  PI>                 '_batchSize', '_callsite', '_checkpointFile',
>>>> '_conf', '_dictToJavaMap', '_do_init', '_ensure_initialized', '_gateway',
>>>>  PI>                 '_getJavaStorageLevel', '_initialize_context',
>>>> '_javaAccumulator', '_jsc', '_jvm', '_lock', '_next_accum_id',
>>>>  PI>                 '_pickled_broadcast_vars', '_python_includes',
>>>> '_repr_html_', '_temp_dir', '_unbatched_serializer', 'accumulator',
>>>> 'addFile',
>>>>  PI>                 'addPyFile', 'appName', 'applicationId',
>>>> 'binaryFiles', 'binaryRecords', 'broadcast', 'cancelAllJobs',
>>>> 'cancelJobGroup',
>>>>  PI>                 'defaultMinPartitions', 'defaultParallelism',
>>>> 'dump_profiles', 'emptyRDD', 'environment', 'getConf', 'getLocalProperty',
>>>>  PI>                 'getOrCreate', 'hadoopFile', 'hadoopRDD',
>>>> 'master', 'newAPIHadoopFile', 'newAPIHadoopRDD', 'parallelize',
>>>> 'pickleFile',
>>>>  PI>                 'profiler_collector', 'pythonExec', 'pythonVer',
>>>> 'range', 'runJob', 'sequenceFile', 'serializer', 'setCheckpointDir',
>>>>  PI>                 'setJobGroup', 'setLocalProperty', 'setLogLevel',
>>>> 'setSystemProperty', 'show_profiles', 'sparkHome', 'sparkUser', 
>>>> 'startTime',
>>>>  PI>                 'statusTracker', 'stop', 'textFile', 'uiWebUrl',
>>>> 'union', 'version', 'wholeTextFiles'] local[*] 8
>>>>  PI>                 ```
>>>>  PI>                 This is what I want.
>>>>  PI>
>>>>  PI>                 The Questions
>>>>  PI>                   @ Why is the "master" property not used in the
>>>> created SparkContext?
>>>>  PI>                   @ How do I add the spark.master property to the
>>>> docker image?
>>>>  PI>
>>>>  PI>                 Any hint or support you can provide would be
>>>> greatly appreciated.
>>>>  PI>
>>>>  PI>                 Yours Sincerely,
>>>>  PI>                 Patrik Iselind
>>>>
>>>>  PI>             --
>>>>  PI>             Best Regards
>>>>  PI>
>>>>  PI>             Jeff Zhang
>>>>
>>>>  PI>     --
>>>>  PI>     Best Regards
>>>>  PI>
>>>>  PI>     Jeff Zhang
>>>>
>>>>
>>>>
>>>> --
>>>> With best wishes,                    Alex Ott
>>>> http://alexott.net/
>>>> Twitter: alexott_en (English), alexott (Russian)
>>>>
>>>
>>
>> --
>> With best wishes,                    Alex Ott
>> http://alexott.net/
>> Twitter: alexott_en (English), alexott (Russian)
>>
>

-- 
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

Re: Apache Spark master value question

Reply via email to