Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Steve Loughran Tue, 30 Jul 2024 11:11:49 -0700

On Thu, 18 Jul 2024 at 00:02, Ryan Blue <b...@apache.org> wrote:

> Hey everyone,
>
> There has been some recent discussion about improving
> HadoopTableOperations and the catalog based on those tables, but we've
> discouraged using file system only table (or "hadoop" tables) for years now
> because of major problems:
> * It is only safe to use hadoop tables with HDFS; most local file systems,
> S3, and other common object stores are unsafe
>


Azure storage and linux local filesystems all support atomic file and dir
rename an delete; google gcs does it for files and dirs only. Windows,
well, anybody who claims to understand the semantics of MoveFile is
probably wrong (
https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-movefilewithprogressw
)

* Despite not providing atomicity guarantees outside of HDFS, people use
> the tables in unsafe situations
>

which means "s3", unless something needs directory rename


> * HadoopCatalog cannot implement atomic operations for rename and drop
> table, which are commonly used in data engineering
> * Alternative file names (for instance when using metadata file
> compression) also break guarantees
>
> While these tables are useful for testing in non-production scenarios, I
> think it's misleading to have them in the core module because there's an
> appearance that they are a reasonable choice. I propose we deprecate the
> HadoopTableOperations and HadoopCatalog implementations and move them to
> tests the next time we can make breaking API changes (2.0).
>
> I think we should also consider similar fixes to the table spec. It
> currently describes how HadoopTableOperations works, which does not work in
> object stores or local file systems. HDFS is becoming much less common and
> I propose that we note that the strategy in the spec should ONLY be used
> with HDFS.
>
> What do other people think?
>
> Ryan
>
> --
> Ryan Blue
>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Reply via email to