On Thu, 18 Jul 2024 at 00:02, Ryan Blue <b...@apache.org> wrote: > Hey everyone, > > There has been some recent discussion about improving > HadoopTableOperations and the catalog based on those tables, but we've > discouraged using file system only table (or "hadoop" tables) for years now > because of major problems: > * It is only safe to use hadoop tables with HDFS; most local file systems, > S3, and other common object stores are unsafe >
Azure storage and linux local filesystems all support atomic file and dir rename an delete; google gcs does it for files and dirs only. Windows, well, anybody who claims to understand the semantics of MoveFile is probably wrong ( https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-movefilewithprogressw ) * Despite not providing atomicity guarantees outside of HDFS, people use > the tables in unsafe situations > which means "s3", unless something needs directory rename > * HadoopCatalog cannot implement atomic operations for rename and drop > table, which are commonly used in data engineering > * Alternative file names (for instance when using metadata file > compression) also break guarantees > > While these tables are useful for testing in non-production scenarios, I > think it's misleading to have them in the core module because there's an > appearance that they are a reasonable choice. I propose we deprecate the > HadoopTableOperations and HadoopCatalog implementations and move them to > tests the next time we can make breaking API changes (2.0). > > I think we should also consider similar fixes to the table spec. It > currently describes how HadoopTableOperations works, which does not work in > object stores or local file systems. HDFS is becoming much less common and > I propose that we note that the strategy in the spec should ONLY be used > with HDFS. > > What do other people think? > > Ryan > > -- > Ryan Blue >