Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

Rafał Wojdyła Fri, 11 Mar 2022 10:17:43 -0800

I don't know why I don't see my last message in the thread here:
https://lists.apache.org/thread/5wgdqp746nj4f6ovdl42rt82wc8ltkcn
Also don't get messages from Artemis in my mail, I can only see them in the
thread web UI, which is very confusing.
On top of that when I click on "reply via your own email client" in the web
UI, I get: Bad Request Error 400


Anyways to answer to your last comment Artemis:

> I guess there are several misconceptions here:

There's no confusion on my side, all that makes sense. When I said "worker"
in that comment I meant the scheduler worker not Spark worker, which in the
Spark realm would be the client.
Everything else you said is undoubtedly correct, but unrelated to the
issue/problem at hand.

Sean, Artemis - I appreciate your feedback about the infra setup, but it's
beside the problem behind this issue.

Let me describe a simpler setup/example with the same problem, say:
 1. I have a jupyter notebook
 2. use local/driver spark mode only
 3. I start the driver, process some data, store it in pandas dataframe
 4. now say I want to add a package to spark driver (or increase the JVM
memory etc)

There's currently no way to do the step 4 without restarting the notebook
process which holds the "reference" to the Spark driver/JVM. If I restart
the Jupter notebook I would lose all the data in memory (e.g. pandas data),
ofc I can save that data to e.g. disk but that's beside the point.

I understand you don't want to provide this functionality in Spark, nor
warn users on changes in Spark Configuration that won't actually work - as
a user I wish I could get at least a warning in that case, but I respect
your decision. It seems like the workaround to shutdown the JVM works in
this case, I would much appreciate your feedback about **that specific
workaround** please. Any reason not to use it?
Cheers - Rafal

On Thu, 10 Mar 2022 at 18:50, Rafał Wojdyła <ravwojd...@gmail.com> wrote:

> If you have a long running python orchestrator worker (e.g. Luigi worker),
> and say it's gets a DAG of A -> B ->C, and say the worker first creates a
> spark driver for A (which doesn't need extra jars/packages), then it gets B
> which is also a spark job but it needs an extra package, it won't be able
> to create a new spark driver with extra packages since it's "not possible"
> to create a new driver JVM. I would argue it's the same scenario if you
> have multiple spark jobs that need different amounts of memory or anything
> that requires JVM restart. Of course I can use the workaround to shut down
> the driver/JVM, do you have any feedback about that workaround (see my
> previous comment or the issue).
>
> On Thu, 10 Mar 2022 at 18:12, Sean Owen <sro...@gmail.com> wrote:
>
>> Wouldn't these be separately submitted jobs for separate workloads? You
>> can of course dynamically change each job submitted to have whatever
>> packages you like, from whatever is orchestrating. A single job doing
>> everything sound right.
>>
>> On Thu, Mar 10, 2022, 12:05 PM Rafał Wojdyła <ravwojd...@gmail.com>
>> wrote:
>>
>>> Because I can't (and should not) know ahead of time which jobs will be
>>> executed, that's the job of the orchestration layer (and can be dynamic). I
>>> know I can specify multiple packages. Also not worried about memory.
>>>
>>> On Thu, 10 Mar 2022 at 13:54, Artemis User <arte...@dtechspace.com>
>>> wrote:
>>>
>>>> If changing packages or jars isn't your concern, why not just specify
>>>> ALL packages that you would need for the Spark environment?  You know you
>>>> can define multiple packages under the packages option.  This shouldn't
>>>> cause memory issues since JVM uses dynamic class loading...
>>>>
>>>> On 3/9/22 10:03 PM, Rafał Wojdyła wrote:
>>>>
>>>> Hi Artemis,
>>>> Thanks for your input, to answer your questions:
>>>>
>>>> > You may want to ask yourself why it is necessary to change the jar
>>>> packages during runtime.
>>>>
>>>> I have a long running orchestrator process, which executes multiple
>>>> spark jobs, currently on a single VM/driver, some of those jobs might
>>>> require extra packages/jars (please see example in the issue).
>>>>
>>>> > Changing package doesn't mean to reload the classes.
>>>>
>>>> AFAIU this is unrelated
>>>>
>>>> > There is no way to reload the same class unless you customize the
>>>> classloader of Spark.
>>>>
>>>> AFAIU this is an implementation detail.
>>>>
>>>> > I also don't think it is necessary to implement a warning or error
>>>> message when changing the configuration since it doesn't do any harm
>>>>
>>>> To reiterate right now the API allows to change configuration of the
>>>> context, without that configuration taking effect. See example of confused
>>>> users here:
>>>>  *
>>>> https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark
>>>>  *
>>>> https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1
>>>>
>>>> I'm curious if you have any opinion about the "hard-reset" workaround,
>>>> copy-pasting from the issue:
>>>>
>>>> ```
>>>> s: SparkSession = ...
>>>>
>>>> # Hard reset:
>>>> s.stop()
>>>> s._sc._gateway.shutdown()
>>>> s._sc._gateway.proc.stdin.close()
>>>> SparkContext._gateway = None
>>>> SparkContext._jvm = None
>>>> ```
>>>>
>>>> Cheers - Rafal
>>>>
>>>> On 2022/03/09 15:39:58 Artemis User wrote:
>>>> > This is indeed a JVM issue, not a Spark issue.  You may want to ask
>>>> > yourself why it is necessary to change the jar packages during
>>>> runtime.
>>>> > Changing package doesn't mean to reload the classes. There is no way
>>>> to
>>>> > reload the same class unless you customize the classloader of Spark.
>>>> I
>>>> > also don't think it is necessary to implement a warning or error
>>>> message
>>>> > when changing the configuration since it doesn't do any harm.  Spark
>>>> > uses lazy binding so you can do a lot of such "unharmful" things.
>>>> > Developers will have to understand the behaviors of each API before
>>>> when
>>>> > using them..
>>>> >
>>>> >
>>>> > On 3/9/22 9:31 AM, Rafał Wojdyła wrote:
>>>> > >  Sean,
>>>> > > I understand you might be sceptical about adding this functionality
>>>> > > into (py)spark, I'm curious:
>>>> > > * would error/warning on update in configuration that is currently
>>>> > > effectively impossible (requires restart of JVM) be reasonable?
>>>> > > * what do you think about the workaround in the issue?
>>>> > > Cheers - Rafal
>>>> > >
>>>> > > On Wed, 9 Mar 2022 at 14:24, Sean Owen <sr...@gmail.com> wrote:
>>>> > >
>>>> > >     Unfortunately this opens a lot more questions and problems than
>>>> it
>>>> > >     solves. What if you take something off the classpath, for
>>>> example?
>>>> > >     change a class?
>>>> > >
>>>> > >     On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła
>>>> > >     <ra...@gmail.com> wrote:
>>>> > >
>>>> > >         Thanks Sean,
>>>> > >         To be clear, if you prefer to change the label on this issue
>>>> > >         from bug to sth else, feel free to do so, no strong opinions
>>>> > >         on my end. What happens to the classpath, whether spark uses
>>>> > >         some classloader magic, is probably an implementation
>>>> detail.
>>>> > >         That said, it's definitely not intuitive that you can change
>>>> > >         the configuration and get the context (with the updated
>>>> > >         config) without any warnings/errors. Also what would you
>>>> > >         recommend as a workaround or solution to this problem? Any
>>>> > >         comments about the workaround in the issue? Keep in mind
>>>> that
>>>> > >         I can't restart the long running orchestration process
>>>> (python
>>>> > >         process if that matters).
>>>> > >         Cheers - Rafal
>>>> > >
>>>> > >         On Wed, 9 Mar 2022 at 13:15, Sean Owen <sr...@gmail.com>
>>>> wrote:
>>>> > >
>>>> > >             That isn't a bug - you can't change the classpath once
>>>> the
>>>> > >             JVM is executing.
>>>> > >
>>>> > >             On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła
>>>> > >             <ra...@gmail.com> wrote:
>>>> > >
>>>> > >                 Hi,
>>>> > >                 My use case is that, I have a long running process
>>>> > >                 (orchestrator) with multiple tasks, some tasks might
>>>> > >                 require extra spark dependencies. It seems once the
>>>> > >                 spark context is started it's not possible to update
>>>> > >                 `spark.jars.packages`? I have reported an issue at
>>>> > >                 https://issues.apache.org/jira/browse/SPARK-38438,
>>>> > >                 together with a workaround ("hard reset of the
>>>> > >                 cluster"). I wonder if anyone has a solution for
>>>> this?
>>>> > >                 Cheers - Rafal
>>>> > >
>>>> >
>>>>
>>>>>
>>>>

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

Reply via email to