Re: Blockmgr directories intermittently not being cleaned up

Jeff Frylings Wed, 30 May 2018 09:01:02 -0700

The logs are not the problem; it is the shuffle files that are not being 
cleaned up.  We do have the configs for log rolling and that is working just 
fine.


ex: /mnt/blockmgr-d65d4a74-d59a-4a06-af93-ba29232f7c5b/31/shuffle_1_46_0.data

> On May 30, 2018, at 9:54 AM, Ajay <ajay.ku...@gmail.com> wrote:
> 
> I have used these configs in the paths to clean up the executor logs.
> 
>       .set("spark.executor.logs.rolling.time.interval", "minutely")
>       .set("spark.executor.logs.rolling.strategy", "time")
>       .set("spark.executor.logs.rolling.maxRetainedFiles", "1")
> 
> On Wed, May 30, 2018 at 8:49 AM Jeff Frylings <jeff.fryli...@oracle.com 
> <mailto:jeff.fryli...@oracle.com>> wrote:
> Intermittently on spark executors we are seeing blockmgr directories not 
> being cleaned up after execution and is filling up disk.  These executors are 
> using Mesos dynamic resource allocation and no single app using an executor 
> seems to be the culprit.  Sometimes an app will run and be cleaned up and 
> then on a subsequent run that same AppExecId will run and not be cleaned up.  
> The runs that have left behind folders did not have any obvious task failures 
> in the SparkUI during that time frame.  
> 
> The Spark shuffle service in the ami is version 2.1.1
> The code is running on spark 2.0.2 in the mesos sandbox.
> 
> In a case where files are cleaned up the spark.log looks like the following
> 18/05/28 14:47:24 INFO ExternalShuffleBlockResolver: Registered executor 
> AppExecId{appId=33d8fe79-a670-4277-b6f3-ee1049724204-8310, execId=95} with 
> ExecutorShuffleInfo{localDirs=[/mnt/blockmgr-b2c7ff97-481e-4482-b9ca-92a5f8d4b25e],
>  subDirsPerLocalDir=64, 
> shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
> ...
> 18/05/29 02:54:09 INFO MesosExternalShuffleBlockHandler: Application 
> 33d8fe79-a670-4277-b6f3-ee1049724204-8310 timed out. Removing shuffle files.
> 18/05/29 02:54:09 INFO ExternalShuffleBlockResolver: Application 
> 33d8fe79-a670-4277-b6f3-ee1049724204-8310 removed, cleanupLocalDirs = true
> 
> 
> In a case where files are not cleaned up we do not see the 
> "MesosExternalShuffleBlockHandler: Application <appId> timed out. Removing 
> shuffle files."
> 
> We are using this config when starting the job "--conf 
> spark.worker.cleanup.enabled=true" but I believe this only pertains to 
> standalone mode and we are using the mesos deployment mode. So I don't think 
> this flag actually does anything. 
> 
> 
> Thanks,
> Jeff
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 
> 
> -- 
> Thanks,
> Ajay

Re: Blockmgr directories intermittently not being cleaned up

Reply via email to