Yep, this is only a problem if you are running in an environment where your Paths may change due to authority or other parameters changing. Basically if any of the non "this is where the file is" information is mutable in your system and changes, you can have data loss with this bug. I'll write up a doc pr while we are thinking about a full fix.
On Mon, Sep 14, 2020 at 8:02 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > Thanks for the heads up on this. It sounds like this is not a concern for > most people, but we should definitely add it to our maintenance docs to > call it out in a warning. Would you like to open a PR for that? > > On Fri, Sep 11, 2020 at 3:45 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> Because the RemoveOrphanFilesAction uses Filesystem.list, the paths of >> files found in the file system can have an authority included in them based >> on the core-site.xml. This is determined >> when listing the files so the entries stored in the metadata tables do >> not necessarily have to match. URIs will have the same scheme and path but >> can have a potentially >> different authority. This means when doing a string matching join in >> Spark, the files found on the system will not match those listed in the >> metadata table and the >> action will determine that the files are no longer required and delete >> them. This leads to removing all the files that are listed with a different >> authority. >> >> This will only affect you if you have changed authorities between writing >> and running RemoveOrphanFilesAction I believe. >> We are doing more investigation but because of the potential for data >> loss I thought it important to share with the dev-list. >> >> If your authority has not changed, or will not change there should be no >> issues. >> >> Thanks for your time, >> Russ >> > > > -- > Ryan Blue > Software Engineer > Netflix >