The logs are not the problem; it is the shuffle files that are not being cleaned up. We do have the configs for log rolling and that is working just fine.
ex: /mnt/blockmgr-d65d4a74-d59a-4a06-af93-ba29232f7c5b/31/shuffle_1_46_0.data > On May 30, 2018, at 9:54 AM, Ajay <ajay.ku...@gmail.com> wrote: > > I have used these configs in the paths to clean up the executor logs. > > .set("spark.executor.logs.rolling.time.interval", "minutely") > .set("spark.executor.logs.rolling.strategy", "time") > .set("spark.executor.logs.rolling.maxRetainedFiles", "1") > > On Wed, May 30, 2018 at 8:49 AM Jeff Frylings <jeff.fryli...@oracle.com > <mailto:jeff.fryli...@oracle.com>> wrote: > Intermittently on spark executors we are seeing blockmgr directories not > being cleaned up after execution and is filling up disk. These executors are > using Mesos dynamic resource allocation and no single app using an executor > seems to be the culprit. Sometimes an app will run and be cleaned up and > then on a subsequent run that same AppExecId will run and not be cleaned up. > The runs that have left behind folders did not have any obvious task failures > in the SparkUI during that time frame. > > The Spark shuffle service in the ami is version 2.1.1 > The code is running on spark 2.0.2 in the mesos sandbox. > > In a case where files are cleaned up the spark.log looks like the following > 18/05/28 14:47:24 INFO ExternalShuffleBlockResolver: Registered executor > AppExecId{appId=33d8fe79-a670-4277-b6f3-ee1049724204-8310, execId=95} with > ExecutorShuffleInfo{localDirs=[/mnt/blockmgr-b2c7ff97-481e-4482-b9ca-92a5f8d4b25e], > subDirsPerLocalDir=64, > shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager} > ... > 18/05/29 02:54:09 INFO MesosExternalShuffleBlockHandler: Application > 33d8fe79-a670-4277-b6f3-ee1049724204-8310 timed out. Removing shuffle files. > 18/05/29 02:54:09 INFO ExternalShuffleBlockResolver: Application > 33d8fe79-a670-4277-b6f3-ee1049724204-8310 removed, cleanupLocalDirs = true > > > In a case where files are not cleaned up we do not see the > "MesosExternalShuffleBlockHandler: Application <appId> timed out. Removing > shuffle files." > > We are using this config when starting the job "--conf > spark.worker.cleanup.enabled=true" but I believe this only pertains to > standalone mode and we are using the mesos deployment mode. So I don't think > this flag actually does anything. > > > Thanks, > Jeff > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > > > > -- > Thanks, > Ajay