[ https://issues.apache.org/jira/browse/SPARK-53364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18015941#comment-18015941 ]
Antony commented on SPARK-53364: -------------------------------- i will do it) > spark spark.local.dir overwrite for yarn cluster execution > ---------------------------------------------------------- > > Key: SPARK-53364 > URL: https://issues.apache.org/jira/browse/SPARK-53364 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.5.6, 4.0.0 > Reporter: Antony > Priority: Major > Labels: features, hadoop, yarn > Original Estimate: 168h > Remaining Estimate: 168h > > h3. Context: > I have a Hadoop cluster with the following configuration: > One OS disk 200gb > Two SSD/NVMe disks (1-2 TB each) > Eight HDDs 16tb > In Hadoop/YARN, I cannot set different yarn.nodemanager.local-dirs locations > for the file cache and the application cache. > h3. Problem: > i need ssd when start python job for python env filecache. > i need hdd when start extra big sql job. > I have seen the spark.local.dir property, but it is for Spark and does not > solve the YARN-level configuration issue. > h3. Resolve: > I want to configure SPARK to use specific directory. > I want to not always use yarn directories from yarn nodemanager > here > > {code:java} > def getConfiguredLocalDirs(conf: SparkConf): Array[String] = { > val shuffleServiceEnabled = conf.get(config.SHUFFLE_SERVICE_ENABLED) > if (isRunningInYarnContainer(conf)) { > // If we are in yarn mode, systems can have different disk layouts so > we must set it > // to what Yarn on this system said was available. Note this assumes > that Yarn has > // created the directories already, and that they are secured so that > only the > // user has access to them. > randomizeInPlace(getYarnLocalDirs(conf).split(",")) > } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { > conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) > } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { > conf.getenv("SPARK_LOCAL_DIRS").split(",") > } else if (conf.getenv("MESOS_SANDBOX") != null && > !shuffleServiceEnabled) { > // Mesos already creates a directory per Mesos task. Spark should use > that directory > // instead so all temporary files are automatically cleaned up when the > Mesos task ends. > // Note that we don't want this if the shuffle service is enabled > because we want to > // continue to serve shuffle files after the executors that wrote them > have already exited. > Array(conf.getenv("MESOS_SANDBOX")) > } else { > if (conf.getenv("MESOS_SANDBOX") != null && shuffleServiceEnabled) { > logInfo("MESOS_SANDBOX available but not using provided Mesos sandbox > because " + > s"${config.SHUFFLE_SERVICE_ENABLED.key} is enabled.") > } > // In non-Yarn mode (or for the driver in yarn-client mode), we cannot > trust the user > // configuration to point to a secure directory. So create a > subdirectory with restricted > // permissions under each listed directory. > conf.get("spark.local.dir", > System.getProperty("java.io.tmpdir")).split(",") > } > } {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org