Github user mmalohlava commented on the pull request:
https://github.com/apache/spark/pull/2691#issuecomment-61567145
Sorry for delayed answer. I was trying to provide better solution without
modifying Spark.
However, regarding Sean's question:
* In our case we need to collect actual distributed state (approx. number
of executors) of the cluster to properly initialize services on all available
executors in cluster. Big picture of our use-case: the proposed solution starts
defined service at each executor, the service exchange info with master
(collect number of available executors + executor ids), and based on that, we
reconfigure services in cluster (they require number of available Spark
executors).
* I do not see a major security problem in class loading, since Spark is
already doing class loading in executor from class path specified via `--jars`
and `--files` parameters. The proposed solution is using the same mechanism.
Nevertheless, in the meantime i was experimenting with solution based on
Patrick's idea. It works in the following way:
* create a dummy RDD with lot of partitions (i.e., trying to force
scheduler to plan execution on all available executors)
* running `map` op on RDD trying to collect collect unique executors ids
and aprox. number of executors
* running another `map` which starts our service only on collected
executors
*The advantage of this solution:*
* does not need any modification of Spark infrastructure
*The major disadvantage of this solution:*
* directly depends on task scheduling, in worst case it will plan
execution of the initialization only on 1 executor from all available executors
* hidden solution which does not expose running services, it collects
only approximation of state.
* overhead of creating dummy RDD with many partitions and running two map
operations
From my point of view, it would be much more clean and beneficial to have
solution which explicitly allows for interception of executor lifecycle.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]