[DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Ryan Blue Wed, 17 Jul 2024 16:02:32 -0700

Hey everyone,

There has been some recent discussion about improving HadoopTableOperations
and the catalog based on those tables, but we've discouraged using file
system only table (or "hadoop" tables) for years now because of major
problems:
* It is only safe to use hadoop tables with HDFS; most local file systems,
S3, and other common object stores are unsafe
* Despite not providing atomicity guarantees outside of HDFS, people use
the tables in unsafe situations
* HadoopCatalog cannot implement atomic operations for rename and drop
table, which are commonly used in data engineering
* Alternative file names (for instance when using metadata file
compression) also break guarantees


While these tables are useful for testing in non-production scenarios, I
think it's misleading to have them in the core module because there's an
appearance that they are a reasonable choice. I propose we deprecate the
HadoopTableOperations and HadoopCatalog implementations and move them to
tests the next time we can make breaking API changes (2.0).

I think we should also consider similar fixes to the table spec. It
currently describes how HadoopTableOperations works, which does not work in
object stores or local file systems. HDFS is becoming much less common and
I propose that we note that the strategy in the spec should ONLY be used
with HDFS.

What do other people think?

Ryan

-- 
Ryan Blue

[DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Reply via email to