Re: Apache Spark master value question

Alex Ott Mon, 18 May 2020 02:17:25 -0700

Hi Jeff

Ok, I'll update PR this evening to remove environment variable... Although,
in some cases (for example for Docker), environment variable could be more
handy - I need to look, maybe I'll rework that part so environment variable
could be also used.


SparkLauncher is getting Spark master as master, then spark.master, then
just use local[*]. But in different places in Spark interpreter I have seen
that only spark.master is used directly - I think that this is cause the
problem.


On Mon, May 18, 2020 at 8:56 AM Jeff Zhang <[email protected]> wrote:

> The env name in interpreter.json and interpreter-setting.json is not used.
> We should remove them.
>
> I still don't understand how master & spark.master would effect the
> behavior. `master` is a legacy stuff that we introduced very long time ago,
> we definitely should use spark.master instead. But actually internally we
> do translate master to spark.master, so not sure why it would cause this
> issue, maybe it is some bugs.
>
>
>
> Alex Ott <[email protected]> 于2020年5月17日周日 下午9:36写道：
>
>> I've seen somewhere in CDH documentation that they use MASTER, that's why
>> I'm asking...
>>
>> On Sun, May 17, 2020 at 3:13 PM Patrik Iselind <[email protected]>
>> wrote:
>>
>>> Thanks a lot for creating the issue. It seems I am not allowed to.
>>>
>>> As I understand it, the environment variable is supposed to be
>>> SPARK_MASTER.
>>>
>>> On Sun, May 17, 2020 at 11:56 AM Alex Ott <[email protected]> wrote:
>>>
>>>> Ok, I've created a JIRA for it:
>>>> https://issues.apache.org/jira/browse/ZEPPELIN-4821 and working on
>>>> patch
>>>>
>>>> I'm not sure about environment variable name - it's simply MASTER,
>>>> should it be `SPARK_MASTER`, or it's a requirement of CDH and other Hadoop
>>>> distributions to have it as MASTER?
>>>>
>>>> On Sat, May 16, 2020 at 3:45 PM Patrik Iselind <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Alex,
>>>>>
>>>>> Thanks a lot for helping out with this.
>>>>>
>>>>> You're correct, but it doesn't seem that it's the
>>>>> interpreter-settings.json for Spark interpreter that is being used. It's
>>>>> conf/interpreter.json. In this file both 0.8.2 and 0.9.0 have
>>>>> ```partial-json
>>>>>     "spark": {
>>>>>       "id": "spark",
>>>>>       "name": "spark",
>>>>>       "group": "spark",
>>>>>       "properties": {
>>>>>         "SPARK_HOME": {
>>>>>           "name": "SPARK_HOME",
>>>>>           "value": "",
>>>>>           "type": "string",
>>>>>           "description": "Location of spark distribution"
>>>>>         },
>>>>>         "master": {
>>>>>           "name": "master",
>>>>>           "value": "local[*]",
>>>>>           "type": "string",
>>>>>           "description": "Spark master uri. local | yarn-client |
>>>>> yarn-cluster | spark master address of standalone mode, ex)
>>>>> spark://master_host:7077"
>>>>>         },
>>>>> ```
>>>>> That "master" should be "spark.master".
>>>>>
>>>>> By adding an explicit spark.master with the value "local[*]" I can use
>>>>> all cores as expected. Without this and printing sc.master I get "local".
>>>>> With the addition of the spark.master property set to "local[*]" and
>>>>> printing sc.master I get "local[*]". My conclusion is
>>>>> that conf/interpreter.json isn't in sync with the 
>>>>> interpreter-settings.json
>>>>> for Spark interpreter.
>>>>>
>>>>> Best regards,
>>>>> Patrik Iselind
>>>>>
>>>>>
>>>>> On Sat, May 16, 2020 at 11:22 AM Alex Ott <[email protected]> wrote:
>>>>>
>>>>>> Spark master is set to `local[*]` by default. Here is corresponding
>>>>>> piece
>>>>>> form interpreter-settings.json for Spark interpreter:
>>>>>>
>>>>>>       "master": {
>>>>>>         "envName": "MASTER",
>>>>>>         "propertyName": "spark.master",
>>>>>>         "defaultValue": "local[*]",
>>>>>>         "description": "Spark master uri. local | yarn-client |
>>>>>> yarn-cluster | spark master address of standalone mode, ex)
>>>>>> spark://master_host:7077",
>>>>>>         "type": "string"
>>>>>>       },
>>>>>>
>>>>>>
>>>>>> Patrik Iselind  at "Sun, 10 May 2020 20:31:08 +0200" wrote:
>>>>>>  PI> Hi Jeff,
>>>>>>
>>>>>>  PI> I've tried the release from
>>>>>> http://zeppelin.apache.org/download.html, both in a docker and
>>>>>> without a docker. They both have the same issue as
>>>>>>  PI> previously described.
>>>>>>
>>>>>>  PI> Can I somehow set spark.master to "local[*]" in zeppelin,
>>>>>> perhaps using some environment variable?
>>>>>>
>>>>>>  PI> When is the next Zeppelin 0.9.0 docker image planned to be
>>>>>> released?
>>>>>>
>>>>>>  PI> Best Regards,
>>>>>>  PI> Patrik Iselind
>>>>>>
>>>>>>  PI> On Sun, May 10, 2020 at 9:26 AM Jeff Zhang <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>  PI>     Hi Patric,
>>>>>>  PI>
>>>>>>  PI>     Do you mind to try the 0.9.0-preview, it might be an issue
>>>>>> of docker container.
>>>>>>  PI>
>>>>>>  PI>     http://zeppelin.apache.org/download.html
>>>>>>
>>>>>>  PI>     Patrik Iselind <[email protected]> 于2020年5月10日周日上午2:30写道：
>>>>>>  PI>
>>>>>>  PI>         Hello Jeff,
>>>>>>  PI>
>>>>>>  PI>         Thank you for looking into this for me.
>>>>>>  PI>
>>>>>>  PI>         Using the latest pushed docker image for 0.9.0 (image
>>>>>> ID 92890adfadfb, built 6 weeks ago), I still see the same issue. My image
>>>>>> has
>>>>>>  PI>         the digest "apache/zeppelin@sha256
>>>>>> :0691909f6884319d366f5d3a5add8802738d6240a83b2e53e980caeb6c658092".
>>>>>>  PI>
>>>>>>  PI>         If it's not on the tip of master, could you guys please
>>>>>> release a newer 0.9.0 image?
>>>>>>  PI>
>>>>>>  PI>         Best Regards,
>>>>>>  PI>         Patrik Iselind
>>>>>>
>>>>>>  PI>         On Sat, May 9, 2020 at 4:03 PM Jeff Zhang <
>>>>>> [email protected]> wrote:
>>>>>>  PI>
>>>>>>  PI>             This might be a bug of 0.8, I tried that in 0.9
>>>>>> (master branch), it works for me.
>>>>>>  PI>
>>>>>>  PI>             print(sc.master)
>>>>>>  PI>             print(sc.defaultParallelism)
>>>>>>  PI>
>>>>>>  PI>             ---
>>>>>>  PI>             local[*] 8
>>>>>>
>>>>>>  PI>             Patrik Iselind <[email protected]>
>>>>>> 于2020年5月9日周六下午8:34写道：
>>>>>>  PI>
>>>>>>  PI>                 Hi,
>>>>>>  PI>
>>>>>>  PI>                 First comes some background, then I have some
>>>>>> questions.
>>>>>>  PI>
>>>>>>  PI>                 Background
>>>>>>  PI>                 I'm trying out Zeppelin 0.8.2 based on the
>>>>>> Docker image. My Docker file looks like this:
>>>>>>  PI>
>>>>>>  PI>                 ```Dockerfile
>>>>>>  PI>                 FROM apache/zeppelin:0.8.2
>>>>>>
>>>>>>
>>>>>>  PI>
>>>>>>  PI>                 # Install Java and some tools
>>>>>>  PI>                 RUN apt-get -y update &&\
>>>>>>  PI>                     DEBIAN_FRONTEND=noninteractive \
>>>>>>  PI>                         apt -y install vim python3-pip
>>>>>>  PI>
>>>>>>  PI>                 RUN python3 -m pip install -U pyspark
>>>>>>  PI>
>>>>>>  PI>                 ENV PYSPARK_PYTHON python3
>>>>>>  PI>                 ENV PYSPARK_DRIVER_PYTHON python3
>>>>>>  PI>                 ```
>>>>>>  PI>
>>>>>>  PI>                 When I start a section like so
>>>>>>  PI>
>>>>>>  PI>                 ```Zeppelin paragraph
>>>>>>  PI>                 %pyspark
>>>>>>  PI>
>>>>>>  PI>                 print(sc)
>>>>>>  PI>                 print()
>>>>>>  PI>                 print(dir(sc))
>>>>>>  PI>                 print()
>>>>>>  PI>                 print(sc.master)
>>>>>>  PI>                 print()
>>>>>>  PI>                 print(sc.defaultParallelism)
>>>>>>  PI>                 ```
>>>>>>  PI>
>>>>>>  PI>                 I get the following output
>>>>>>  PI>
>>>>>>  PI>                 ```output
>>>>>>  PI>                 <SparkContext master=local appName=Zeppelin>
>>>>>> ['PACKAGE_EXTENSIONS', '__class__', '__delattr__', '__dict__', '__dir__',
>>>>>>  PI>                 '__doc__', '__enter__', '__eq__', '__exit__',
>>>>>> '__format__', '__ge__', '__getattribute__', '__getnewargs__', '__gt__',
>>>>>>  PI>                 '__hash__', '__init__', '__le__', '__lt__',
>>>>>> '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
>>>>>> '__repr__',
>>>>>>  PI>                 '__setattr__', '__sizeof__', '__str__',
>>>>>> '__subclasshook__', '__weakref__', '_accumulatorServer',
>>>>>> '_active_spark_context',
>>>>>>  PI>                 '_batchSize', '_callsite', '_checkpointFile',
>>>>>> '_conf', '_dictToJavaMap', '_do_init', '_ensure_initialized', '_gateway',
>>>>>>  PI>                 '_getJavaStorageLevel', '_initialize_context',
>>>>>> '_javaAccumulator', '_jsc', '_jvm', '_lock', '_next_accum_id',
>>>>>>  PI>                 '_pickled_broadcast_vars', '_python_includes',
>>>>>> '_repr_html_', '_temp_dir', '_unbatched_serializer', 'accumulator',
>>>>>> 'addFile',
>>>>>>  PI>                 'addPyFile', 'appName', 'applicationId',
>>>>>> 'binaryFiles', 'binaryRecords', 'broadcast', 'cancelAllJobs',
>>>>>> 'cancelJobGroup',
>>>>>>  PI>                 'defaultMinPartitions', 'defaultParallelism',
>>>>>> 'dump_profiles', 'emptyRDD', 'environment', 'getConf', 
>>>>>> 'getLocalProperty',
>>>>>>  PI>                 'getOrCreate', 'hadoopFile', 'hadoopRDD',
>>>>>> 'master', 'newAPIHadoopFile', 'newAPIHadoopRDD', 'parallelize',
>>>>>> 'pickleFile',
>>>>>>  PI>                 'profiler_collector', 'pythonExec', 'pythonVer',
>>>>>> 'range', 'runJob', 'sequenceFile', 'serializer', 'setCheckpointDir',
>>>>>>  PI>                 'setJobGroup', 'setLocalProperty',
>>>>>> 'setLogLevel', 'setSystemProperty', 'show_profiles', 'sparkHome',
>>>>>> 'sparkUser', 'startTime',
>>>>>>  PI>                 'statusTracker', 'stop', 'textFile', 'uiWebUrl',
>>>>>> 'union', 'version', 'wholeTextFiles'] local 1
>>>>>>  PI>                 ```
>>>>>>  PI>
>>>>>>  PI>                 This even though the "master" property in the
>>>>>> interpretter is set to "local[*]". I'd like to use all cores on my 
>>>>>> machine.
>>>>>> To
>>>>>>  PI>                 do that I have to explicitly create the
>>>>>> "spark.master" property in the spark interpretter with the value
>>>>>> "local[*]", then I
>>>>>>  PI>                 get
>>>>>>  PI>
>>>>>>  PI>                 ```new output
>>>>>>  PI>                 <SparkContext master=local[*] appName=Zeppelin>
>>>>>> ['PACKAGE_EXTENSIONS', '__class__', '__delattr__', '__dict__', '__dir__',
>>>>>>  PI>                 '__doc__', '__enter__', '__eq__', '__exit__',
>>>>>> '__format__', '__ge__', '__getattribute__', '__getnewargs__', '__gt__',
>>>>>>  PI>                 '__hash__', '__init__', '__le__', '__lt__',
>>>>>> '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
>>>>>> '__repr__',
>>>>>>  PI>                 '__setattr__', '__sizeof__', '__str__',
>>>>>> '__subclasshook__', '__weakref__', '_accumulatorServer',
>>>>>> '_active_spark_context',
>>>>>>  PI>                 '_batchSize', '_callsite', '_checkpointFile',
>>>>>> '_conf', '_dictToJavaMap', '_do_init', '_ensure_initialized', '_gateway',
>>>>>>  PI>                 '_getJavaStorageLevel', '_initialize_context',
>>>>>> '_javaAccumulator', '_jsc', '_jvm', '_lock', '_next_accum_id',
>>>>>>  PI>                 '_pickled_broadcast_vars', '_python_includes',
>>>>>> '_repr_html_', '_temp_dir', '_unbatched_serializer', 'accumulator',
>>>>>> 'addFile',
>>>>>>  PI>                 'addPyFile', 'appName', 'applicationId',
>>>>>> 'binaryFiles', 'binaryRecords', 'broadcast', 'cancelAllJobs',
>>>>>> 'cancelJobGroup',
>>>>>>  PI>                 'defaultMinPartitions', 'defaultParallelism',
>>>>>> 'dump_profiles', 'emptyRDD', 'environment', 'getConf', 
>>>>>> 'getLocalProperty',
>>>>>>  PI>                 'getOrCreate', 'hadoopFile', 'hadoopRDD',
>>>>>> 'master', 'newAPIHadoopFile', 'newAPIHadoopRDD', 'parallelize',
>>>>>> 'pickleFile',
>>>>>>  PI>                 'profiler_collector', 'pythonExec', 'pythonVer',
>>>>>> 'range', 'runJob', 'sequenceFile', 'serializer', 'setCheckpointDir',
>>>>>>  PI>                 'setJobGroup', 'setLocalProperty',
>>>>>> 'setLogLevel', 'setSystemProperty', 'show_profiles', 'sparkHome',
>>>>>> 'sparkUser', 'startTime',
>>>>>>  PI>                 'statusTracker', 'stop', 'textFile', 'uiWebUrl',
>>>>>> 'union', 'version', 'wholeTextFiles'] local[*] 8
>>>>>>  PI>                 ```
>>>>>>  PI>                 This is what I want.
>>>>>>  PI>
>>>>>>  PI>                 The Questions
>>>>>>  PI>                   @ Why is the "master" property not used in the
>>>>>> created SparkContext?
>>>>>>  PI>                   @ How do I add the spark.master property to
>>>>>> the docker image?
>>>>>>  PI>
>>>>>>  PI>                 Any hint or support you can provide would be
>>>>>> greatly appreciated.
>>>>>>  PI>
>>>>>>  PI>                 Yours Sincerely,
>>>>>>  PI>                 Patrik Iselind
>>>>>>
>>>>>>  PI>             --
>>>>>>  PI>             Best Regards
>>>>>>  PI>
>>>>>>  PI>             Jeff Zhang
>>>>>>
>>>>>>  PI>     --
>>>>>>  PI>     Best Regards
>>>>>>  PI>
>>>>>>  PI>     Jeff Zhang
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> With best wishes,                    Alex Ott
>>>>>> http://alexott.net/
>>>>>> Twitter: alexott_en (English), alexott (Russian)
>>>>>>
>>>>>
>>>>
>>>> --
>>>> With best wishes,                    Alex Ott
>>>> http://alexott.net/
>>>> Twitter: alexott_en (English), alexott (Russian)
>>>>
>>>
>>
>> --
>> With best wishes,                    Alex Ott
>> http://alexott.net/
>> Twitter: alexott_en (English), alexott (Russian)
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


-- 
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

Re: Apache Spark master value question

Reply via email to