[jira] [Comment Edited] (SPARK-21942) ix DiskBlockManager crashing when a root local folder has been externally deleted by OS

Ruslan Shestopalyuk (JIRA) Thu, 07 Sep 2017 04:50:21 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156708#comment-16156708
 ]


Ruslan Shestopalyuk edited comment on SPARK-21942 at 9/7/17 11:49 AM:
----------------------------------------------------------------------

Yes, configuring a different scratch directory is a workaround that we are 
doing right now.

However, we had to spend some time to realize that this is what we need to do - 
it was not very obvious from the error message/exception.

Edit: just to elaborate on why I think both of the suggestions, while will fix 
the problem, are not ideal:

* configuring a different scratch directory is of course an option, but one has 
to explicitly take care of it: know that it exists, know what the default value 
is and how it can be potentially dangerous. If the top-level scratch folder 
_does_ get deleted by the system, the thrown exception does not really point to 
what to do (it just says that DiskBlockManager failed to create some folder, 
which may be for a plenty of different reasons)
* changing the system settings may not be a good idea, because  in some cases 
it can lead to bloating the _/tmp_ folder and running out of disk space. Also 
it a bit like a "tail wagging the dog" situation, since spark processes may not 
necessarily be the most important ones in the system.
* it may be preferred to use _/tmp_, still, because often it is mapped to a 
RAM-disk, which is faster (of course, one could argue that one can do that for 
any other point, but for that, once again, one has to be aware of the whole 
issue)



was (Author: rshest):
Yes, configuring a different scratch directory is a workaround that we are 
doing right now.

However, we had to spend some time to realize that this is what we need to do - 
it was not very obvious from the error message/exception.

Edit: just to elaborate on why I think both of the suggestions, while will fix 
the problem, are not ideal:

* configuring a different scratch directory is of course an option, but one has 
to explicitly take care of it: know that it exists, know what the default value 
is and how it can be potentially dangerous. If the top-level scratch folder 
_does_ get deleted by the system, the thrown exception does not really point to 
what to do (it just says that DiskBlockManager failed to create some folder, 
which may be for a plenty of different reasons)
* changing the system settings may not be a good idea, because  in some cases 
it can lead to bloating the _/tmp_ folder and running out of disk space. Also 
it a bit like a "tail wagging the dog" situation, since spark processes may not 
necessarily be the most important ones in the system.



> ix DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-21942
>                 URL: https://issues.apache.org/jira/browse/SPARK-21942
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 
> 2.2.0, 2.2.1, 2.3.0, 3.0.0
>            Reporter: Ruslan Shestopalyuk
>            Priority: Minor
>              Labels: storage
>             Fix For: 2.3.0
>
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via_ spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g._ /blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function_ DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default_ /tmp _folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21942) ix DiskBlockManager crashing when a root local folder has been externally deleted by OS

Reply via email to