According to our requirements, this function is for some users who want to read iceberg tables without relying on any catalogs, I think the StaticTable may be more flexible and clear in semantics. For StaticTable, it's the user's responsibility to decide which metadata of the table to read. But for read-only HadoopCatalog, the metadata may be decided by Catalog, is there a probability that the strategy used by HadoopCatalog is not compatible with the table managed by other catalogs?
Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道: > I think there are two ways to do this: > 1. As Xuanwo said, we refactor HadoopCatalog to be read only, and throw > unsupported operation exception for other operations that manipulate tables. > 2. Totally deprecate HadoopCatalog, and add StaticTable as we did in > pyiceberg or iceberg-rust. > > On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org> wrote: > >> Hi, Renjie >> >> Are you suggesting that we refactor HadoopCatalog as a FileSystemCatalog >> to enable direct reading from file systems like HDFS, S3, and Azure Blob >> Storage? This catalog will be read-only that don't support write operations. >> >> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote: >> >> Hi, Ryan: >> >> Thanks for raising this. I agree that HadoopCatalog is dangerous in >> manipulating tables/catalogs given limitations of different file systems. >> But I see that there are some users who want to read iceberg tables without >> relying on any catalogs, this is also the motivational use case of >> StaticTable in pyiceberg and iceberg-rust, is there similar things in java >> implementation? >> >> >> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org> wrote: >> >> Hey everyone, >> >> There has been some recent discussion about improving >> HadoopTableOperations and the catalog based on those tables, but we've >> discouraged using file system only table (or "hadoop" tables) for years now >> because of major problems: >> * It is only safe to use hadoop tables with HDFS; most local file >> systems, S3, and other common object stores are unsafe >> * Despite not providing atomicity guarantees outside of HDFS, people use >> the tables in unsafe situations >> * HadoopCatalog cannot implement atomic operations for rename and drop >> table, which are commonly used in data engineering >> * Alternative file names (for instance when using metadata file >> compression) also break guarantees >> >> While these tables are useful for testing in non-production scenarios, I >> think it's misleading to have them in the core module because there's an >> appearance that they are a reasonable choice. I propose we deprecate the >> HadoopTableOperations and HadoopCatalog implementations and move them to >> tests the next time we can make breaking API changes (2.0). >> >> I think we should also consider similar fixes to the table spec. It >> currently describes how HadoopTableOperations works, which does not work in >> object stores or local file systems. HDFS is becoming much less common and >> I propose that we note that the strategy in the spec should ONLY be used >> with HDFS. >> >> What do other people think? >> >> Ryan >> >> -- >> Ryan Blue >> >> >> Xuanwo >> >> https://xuanwo.io/ >> >>