Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Fokko Driesprong Tue, 30 Jul 2024 01:23:14 -0700

Hey everyone,

Lisoda,


In recent days, I've attempted to migrate hadoop_catalog to jdbc-catalog,
> but I failed.


Was this because the JDBC(or SQL)-Catalog didn't work, or the migration was
not feasible? In the case of the first, I invite you to raise an issue on
Github to see what's happening.

Next to the HadoopCatalog there is also the SQL-Catalog as mentioned above.
This is available in Java, PyIceberg, and in-flight for Rust. While for the
HadoopCatalog the correctness depends on the guarantee of the underlying
storage, with the SQLCatalog we can also move forward and implement
features like multi-table transactions. PyIceberg relies heavily on the
SQLCatalog with an in-memory database (SQLite) for integration tests.

Since there is no consensus, I believe clarifying the spec and moving the
HadoopCatalog to a separate package are the first two steps.

Kind regards,
Fokko

Op di 30 jul 2024 om 09:43 schreef Gabor Kaszab <gaborkas...@apache.org>:

> Hey Iceberg Community,
>
> Sorry, for being late to this conversation. I just wanted to share that
> I'm against deprecating HadoopCatalog or moving it to tests. Currently
> Impala relies heavily on HadoopCatalog for it's own tests and I personally
> find HadoopCatalog pretty handy when I just want to do some cross-engine
> experiments where my data is already on HDFS and I just write a table with
> engineA and see if engineB can read it and I don't want to bother with
> setting up any services to serve as an Iceberg catalog (HMS for instance).
>
> I believe that even though we don't consider HadoopCatalog a production
> grade solution as it is now, it has its benefits for lightweight
> experimentation.
>
>    - I'm +1 for keeping HadoopCatalog
>    - We should emphasize that HDFS is the desired storage for
>    HadoopCatalog (can we force this in the code?)
>    - Apparently, there is a part of this community that is open to add
>    enhancements to HadoopCatalog to bring it closer to production gradeness
>    (lisoda). I don't think we shouldn't block these contributions.
>    - If we say that REST Catalog is preferred over HadoopCatalog I think
>    the Iceberg project should offer its own open-source solution available for
>    everyone.
>
> Regards,
> Gabor
>
> On Thu, Jul 25, 2024 at 9:04 PM Ryan Blue <b...@databricks.com.invalid>
> wrote:
>
>> There are ways to use object store or file system features to do this,
>> but there are a lot of variations. Building implementations and trying to
>> standardize each one is a lot of work. And then you still get a catalog
>> that doesn't support important features.
>>
>> I don't think that this is a good direction to build for the Iceberg
>> project. But I also have no objection to someone doing it in a different
>> project that uses the Iceberg metadata format.
>>
>> On Tue, Jul 23, 2024 at 5:57 PM lisoda <lis...@yeah.net> wrote:
>>
>>>
>>> Sir, regarding this point, we have some experience. In my view, as long
>>> as the file system supports atomic single-file writing, where the file
>>> becomes immediately visible upon the client's successful write operation,
>>> that is sufficient. We can do without the rename operation as long as the
>>> file system guarantees this feature. Of course, if the object storage
>>> system supports mutex operations, we can also uniformly use the rename
>>> operation for committing. We can theoretically avoid the situation of
>>> providing a large number of commit strategies for different file systems.
>>> ---- Replied Message ----
>>> From Jack Ye<yezhao...@gmail.com> <yezhao...@gmail.com>
>>> Date 07/24/2024 02:52
>>> To dev@iceberg.apache.org
>>> Cc
>>> Subject Re: Re: [DISCUSS] Deprecate HadoopTableOperations, move to
>>> tests in 2.0
>>> If we come up with a new storage-only catalog implementation that could
>>> solve those limitations and also leverage the new features being developed
>>> in object storage, would that be a potential alternative strategy? so
>>> HadoopCatalog users has a way to move forward with still a storage-only
>>> catalog that can run on HDFS, and we can fully deprecate HadoopCatalog.
>>>
>>> -Jack
>>>
>>> On Tue, Jul 23, 2024 at 10:00 AM Ryan Blue <b...@databricks.com.invalid>
>>> wrote:
>>>
>>>> I don't think we would want to put this in a module with other catalog
>>>> implementations. It has serious limitations and is actively discouraged,
>>>> while the other catalog implementations still have value as either REST
>>>> back-end catalogs or as regular catalogs for many users.
>>>>
>>>> On Tue, Jul 23, 2024 at 9:11 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>>> For some additional information, we also have some Iceberg HDFS users
>>>>> on EMR. Those are mainly users that have long-running Hadoop and HBase
>>>>> installations. They typically refresh their installation every 1-2 years.
>>>>> From my understanding, they use S3 for data storage, but metadata is kept
>>>>> in the local HDFS cluster, thus HadoopCatalog works well for them.
>>>>>
>>>>> I remember we discussed moving all catalog implementations in the main
>>>>> repo right now to a separated iceberg-catalogs repo. Could we do this move
>>>>> as a part of that effort?
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Tue, Jul 23, 2024 at 8:46 AM Ryan Blue <b...@databricks.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the context, lisoda. I agree that it's good to understand
>>>>>> the issues you're facing with the HadoopCatalog. One follow up question
>>>>>> that I have is what the underlying storage is. Are you using HDFS for 
>>>>>> those
>>>>>> 30,000 customers?
>>>>>>
>>>>>> I think you're right that there is a challenge to migrating. Because
>>>>>> there is no catalog requirement, it's hard to make sure you have all of 
>>>>>> the
>>>>>> writers migrated. I think that means we do need to have a plan or
>>>>>> recommendation for people currently using this catalog in production, but
>>>>>> it also puts more pressure on us to deprecate this catalog and avoid more
>>>>>> people having this problem.
>>>>>>
>>>>>> I think it's a good idea to make the spec change, which we have
>>>>>> agreement for and to ensure that the FS catalog and table operations are
>>>>>> properly deprecated to show that they should not be used. I'm not sure
>>>>>> whether there is support in the community for moving the implementation
>>>>>> into a new iceberg-hadoop module, but at a minimum we can't just remove
>>>>>> this right away. I think that a separate iceberg-hadoop module would make
>>>>>> the most sense.
>>>>>>
>>>>>> On Thu, Jul 18, 2024 at 11:09 PM lisoda <lis...@yeah.net> wrote:
>>>>>>
>>>>>>> Hi team.
>>>>>>>      I am not a pmc member, just a regular user. Instead of
>>>>>>> discussing whether hadoopcatalog needs to continue to exist, I'd like to
>>>>>>> share a more practical issue.
>>>>>>>
>>>>>>>     We currently serve over 30,000 customers, all of whom use
>>>>>>> Iceberg to store their foundational data, and all business analyses are
>>>>>>> conducted based on Iceberg. However, all the Iceberg tables are
>>>>>>> hadoop_catalog. At least, this has been the case since I started working
>>>>>>> with our production environment system.
>>>>>>>
>>>>>>>     In recent days, I've attempted to migrate hadoop_catalog to
>>>>>>> jdbc-catalog, but I failed. We store 2PB of data, and replacing the 
>>>>>>> current
>>>>>>> catalogues has become an almost impossible task. Users not only create
>>>>>>> hadoop_catalog tables through Spark, they also continuously use 
>>>>>>> third-party
>>>>>>> OLAP systems/FLINK and other means to write data into Iceberg in the 
>>>>>>> form
>>>>>>> of hadoop_catalog. Given this situation, we can only continue to fix
>>>>>>> hadoop_catalog and provide services to customers.
>>>>>>>
>>>>>>>     I understand that the community wants to make a big push into
>>>>>>> rest-catalog, and I agree with the direction the community is going.But 
>>>>>>> considering
>>>>>>> that there might be a significant number of users facing similar issues,
>>>>>>> can we at least retain a module similar to iceberg-hadoop to extend
>>>>>>> hadoop_catalog? If it is removed, we won't be able to continue providing
>>>>>>> services to customers. So, if possible, please consider this option.
>>>>>>>
>>>>>>> Thank you all.
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> lisoda
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> At 2024-07-19 01:28:18, "Jack Ye" <yezhao...@gmail.com> wrote:
>>>>>>>
>>>>>>> Thank you for bringing this up Ryan. I have been also in the camp of
>>>>>>> saying HadoopCatalog is not recommended, but after thinking about this 
>>>>>>> more
>>>>>>> deeply last night, I now have mixed feelings about this topic. Just to
>>>>>>> comment on the reasons you listed first:
>>>>>>>
>>>>>>> * For reason 1 & 2, it looks like the root cause is that people try
>>>>>>> to use HadoopCatalog outside native HDFS because there are HDFS 
>>>>>>> connectors
>>>>>>> to other storages like S3AFileSystem. However, the norm for such usage 
>>>>>>> has
>>>>>>> been that those connectors do not strictly follow HDFS semantics, and 
>>>>>>> it is
>>>>>>> assumed that people acknowledge the implication of such usage and accept
>>>>>>> the risk. For example, S3AFileSystem was there even before S3 was 
>>>>>>> strongly
>>>>>>> consistent, but people have been using that to write files.
>>>>>>>
>>>>>>> * For reason 3, there are multiple catalogs that do not support all
>>>>>>> operations (e.g. Glue for atomic table rename) and people still widely 
>>>>>>> use
>>>>>>> it.
>>>>>>>
>>>>>>> * For reason 4, I see that more as a missing feature. More features
>>>>>>> could definitely be developed in that catalog implementation.
>>>>>>>
>>>>>>> So the key question to me is, how can we prevent people from using
>>>>>>> HadoopCatalog outside native HDFS. We know HadoopCatalog is popular 
>>>>>>> because
>>>>>>> it is a storage only solution. For object storages specifically,
>>>>>>> HadoopCatalog is not suitable for 2 reasons:
>>>>>>>
>>>>>>> (1) file write does not enforce mutual exclusion, thus cannot
>>>>>>> enforce Iceberg optimistic concurrency requirement (a.k.a. cannot do 
>>>>>>> atomic
>>>>>>> and swap)
>>>>>>>
>>>>>>> (2) directory-based design is not preferred in object storage and
>>>>>>> will result in bad performance.
>>>>>>>
>>>>>>> However, now I look at these 2 issues, they are getting outdated.
>>>>>>>
>>>>>>> (1) object storage is starting to enforce file mutual exclusion. GCS
>>>>>>> supports file generation number [1] that increments monotonically, and 
>>>>>>> can
>>>>>>> use x-goog-if-generation-match [2] to perform atomic swap. Similar 
>>>>>>> feature
>>>>>>> [3] exists in Azure Blob Storage. I cannot speak for the S3 team 
>>>>>>> roadmap.
>>>>>>> But Amazon S3 is clearly falling behind in this domain, and with market
>>>>>>> competition, it is very clear that similar features will come in 
>>>>>>> reasonably
>>>>>>> near future.
>>>>>>>
>>>>>>> (2) directory bucket is becoming the norm. Amazon S3 announced
>>>>>>> directory bucket in 2023 re:invent [4], which does not have the same
>>>>>>> performance limitation even if you have very nested folders and many
>>>>>>> objects in a folder. GCS also has a similar feature launched in preview 
>>>>>>> [5]
>>>>>>> right now. Azure also already has this feature since 2021 [6].
>>>>>>>
>>>>>>> With these new developments in the industry, a storage-only Iceberg
>>>>>>> catalog becomes very attractive. It is simple with only one service
>>>>>>> dependency. It can safely perform atomic compare-and-swap. It is 
>>>>>>> performant
>>>>>>> without the need to worry about folder and file organization. If you 
>>>>>>> want
>>>>>>> to add additional features for things like access control, there are 
>>>>>>> also
>>>>>>> integrations like access grant [7] that can be integrated to do it in a
>>>>>>> very scalable way.
>>>>>>>
>>>>>>> I know the direction in the community so far is to go with the REST
>>>>>>> catalog, and I am personally a big advocate for that. However, that
>>>>>>> requires either building a full REST catalog, or choosing a catalog 
>>>>>>> vendor
>>>>>>> that supports REST. There are many capabilities that REST would unlock, 
>>>>>>> but
>>>>>>> those are visions which I expect will take many years down the road for 
>>>>>>> the
>>>>>>> community to continue to drive consensus and build those features. If I 
>>>>>>> am
>>>>>>> the CTO of a small company and I just want an Iceberg data lake(house)
>>>>>>> right now, do I choose REST, or do I choose (or even just build) a
>>>>>>> storage-only Iceberg catalog? I feel I would actually choose the later.
>>>>>>>
>>>>>>> Going back to the discussion points, my current take of this topic
>>>>>>> is that:
>>>>>>>
>>>>>>> (1) +1 for clarifying that HadoopCatalog should only work with HDFS
>>>>>>> in the spec.
>>>>>>>
>>>>>>> (2) +1 if we want to block non-HDFS use cases in HadoopCatalog by
>>>>>>> default (e.g. fail if using S3A), but we should allow a feature flag to
>>>>>>> unblock the usage so that people can use it after understanding the
>>>>>>> implications and risks, just like how people use S3A today.
>>>>>>>
>>>>>>> (3) +0 for removing HadoopCatalog from the core library. It could be
>>>>>>> in a different module like iceberg-hdfs if that is more suitable.
>>>>>>>
>>>>>>> (4) -1 for moving HadoopCatalog to tests, because HDFS is still a
>>>>>>> valid use case for Iceberg. After the measures 1-3 above, people 
>>>>>>> actually
>>>>>>> having a HDFS use case should be able to continue to innovate and 
>>>>>>> optimize
>>>>>>> the HadoopCatalog implementation. Although "HDFS is becoming much less
>>>>>>> common", looking at GitHub issues and discussion forums, it still has a
>>>>>>> pretty big user base.
>>>>>>>
>>>>>>> (5) In general, I propose we separate the discussion of
>>>>>>> HadoopCatalog from a "storage only catalog" that also deals with other
>>>>>>> object stages when evaluating it. With these latest industry 
>>>>>>> developments,
>>>>>>> we should evaluate the direction for building a storage only Iceberg
>>>>>>> catalog and see if the community has an interest in that. I could help
>>>>>>> raise a thread about it after this discussion is closed.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jack Ye
>>>>>>>
>>>>>>> [1]
>>>>>>> https://cloud.google.com/storage/docs/object-versioning#file_restoration_behavior
>>>>>>> [2]
>>>>>>> https://cloud.google.com/storage/docs/xml-api/reference-headers#xgoogifgenerationmatch
>>>>>>> [3]
>>>>>>> https://learn.microsoft.com/en-us/rest/api/storageservices/specifying-conditional-headers-for-blob-service-operations
>>>>>>> [4]
>>>>>>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-overview.html
>>>>>>> [5] https://cloud.google.com/storage/docs/buckets#enable-hns
>>>>>>> [6]
>>>>>>> https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace
>>>>>>> [7]
>>>>>>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 18, 2024 at 7:16 AM Eduard Tudenhöfner <
>>>>>>> etudenhoef...@apache.org> wrote:
>>>>>>>
>>>>>>>> +1 on deprecating now and removing them from the codebase with
>>>>>>>> Iceberg 2.0
>>>>>>>>
>>>>>>>> On Thu, Jul 18, 2024 at 10:40 AM Ajantha Bhat <
>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> +1 on deprecating the `File System Tables` from spec and
>>>>>>>>> `HadoopCatalog`, `HadoopTableOperations` in code for now
>>>>>>>>> and removing them permanently during 2.0 release.
>>>>>>>>>
>>>>>>>>> For testing we can use `InMemoryCatalog` as others mentioned.
>>>>>>>>>
>>>>>>>>> I am not sure about moving to test or keeping them only for HDFS.
>>>>>>>>> Because, it leads to confusion to existing users of Hadoop catalog.
>>>>>>>>>
>>>>>>>>> I wanted to have it deprecated 2 years ago
>>>>>>>>> <https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1647950504955309>
>>>>>>>>> and I remember that we discussed it in sync that time and left it as 
>>>>>>>>> it is.
>>>>>>>>> Also, when the user brought this up in slack
>>>>>>>>> <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720075009593789?thread_ts=1719993403.208859&cid=C03LG1D563F>
>>>>>>>>> recently about lockmanager and refactoring the HadoopTableOperations,
>>>>>>>>> I have asked to open this discussion on the mailing list. So, that
>>>>>>>>> we can conclude it once and for all.
>>>>>>>>>
>>>>>>>>> - Ajantha
>>>>>>>>>
>>>>>>>>> On Thu, Jul 18, 2024 at 12:49 PM Fokko Driesprong <
>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Ryan and others,
>>>>>>>>>>
>>>>>>>>>> Thanks for bringing this up. I would be in favor of removing the
>>>>>>>>>> HadoopTableOperations, mostly because of the reasons that you already
>>>>>>>>>> mentioned, but also about the fact that it is not fully in line with 
>>>>>>>>>> the
>>>>>>>>>> first principles of Iceberg (being object store native) as it uses
>>>>>>>>>> file-listing.
>>>>>>>>>>
>>>>>>>>>> I think we should deprecate the HadoopTables to raise the
>>>>>>>>>> attention of their users. I would be reluctant to move it to test to 
>>>>>>>>>> just
>>>>>>>>>> use it for testing purposes, I'd rather remove it and replace its 
>>>>>>>>>> use in
>>>>>>>>>> tests with the InMemoryCatalog.
>>>>>>>>>>
>>>>>>>>>> Regarding the StaticTable, this is an easy way to have a
>>>>>>>>>> read-only table by directly pointing to the metadata. This also 
>>>>>>>>>> lives in
>>>>>>>>>> Java under StaticTableOperations
>>>>>>>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/StaticTableOperations.java>.
>>>>>>>>>> It isn't a full-blown catalog where you can list {tables,schemas},
>>>>>>>>>> update tables, etc. As ZENOTME pointed out already, it is all up to 
>>>>>>>>>> the
>>>>>>>>>> user, for example, there is no listing of directories to determine 
>>>>>>>>>> which
>>>>>>>>>> tables are in the catalog.
>>>>>>>>>>
>>>>>>>>>> is there a probability that the strategy used by HadoopCatalog is
>>>>>>>>>>> not compatible with the table managed by other catalogs?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, so they are different, you can see in the spec the section
>>>>>>>>>> on File System tables
>>>>>>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#file-system-tables>,
>>>>>>>>>> is used by the HadoopTable implementation. Whereas the other catalogs
>>>>>>>>>> follow the Metastore Tables
>>>>>>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> Kind regards,
>>>>>>>>>> Fokko
>>>>>>>>>>
>>>>>>>>>> Op do 18 jul 2024 om 07:19 schreef NOTME ZE <
>>>>>>>>>> st810918...@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> According to our requirements, this function is for some users
>>>>>>>>>>> who want to read iceberg tables without relying on any catalogs, I 
>>>>>>>>>>> think
>>>>>>>>>>> the StaticTable may be more flexible and clear in semantics. For
>>>>>>>>>>> StaticTable, it's the user's responsibility to decide which 
>>>>>>>>>>> metadata of the
>>>>>>>>>>> table to read. But for read-only HadoopCatalog, the metadata may be
>>>>>>>>>>> decided by Catalog, is there a probability that the strategy used by
>>>>>>>>>>> HadoopCatalog is not compatible with the table managed by other 
>>>>>>>>>>> catalogs?
>>>>>>>>>>>
>>>>>>>>>>> Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道：
>>>>>>>>>>>
>>>>>>>>>>>> I think there are two ways to do this:
>>>>>>>>>>>> 1. As Xuanwo said, we refactor HadoopCatalog to be read only,
>>>>>>>>>>>> and throw unsupported operation exception for other operations that
>>>>>>>>>>>> manipulate tables.
>>>>>>>>>>>> 2. Totally deprecate HadoopCatalog, and add StaticTable as we
>>>>>>>>>>>> did in pyiceberg or iceberg-rust.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, Renjie
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are you suggesting that we refactor HadoopCatalog as a
>>>>>>>>>>>>> FileSystemCatalog to enable direct reading from file systems like 
>>>>>>>>>>>>> HDFS, S3,
>>>>>>>>>>>>> and Azure Blob Storage? This catalog will be read-only that don't 
>>>>>>>>>>>>> support
>>>>>>>>>>>>> write operations.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, Ryan:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for raising this. I agree that HadoopCatalog is
>>>>>>>>>>>>> dangerous in manipulating tables/catalogs given limitations of 
>>>>>>>>>>>>> different
>>>>>>>>>>>>> file systems. But I see that there are some users who want to 
>>>>>>>>>>>>> read iceberg
>>>>>>>>>>>>> tables without relying on any catalogs, this is also the 
>>>>>>>>>>>>> motivational use
>>>>>>>>>>>>> case of StaticTable in pyiceberg and iceberg-rust, is there 
>>>>>>>>>>>>> similar things
>>>>>>>>>>>>> in java implementation?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> There has been some recent discussion about improving
>>>>>>>>>>>>> HadoopTableOperations and the catalog based on those tables, but 
>>>>>>>>>>>>> we've
>>>>>>>>>>>>> discouraged using file system only table (or "hadoop" tables) for 
>>>>>>>>>>>>> years now
>>>>>>>>>>>>> because of major problems:
>>>>>>>>>>>>> * It is only safe to use hadoop tables with HDFS; most local
>>>>>>>>>>>>> file systems, S3, and other common object stores are unsafe
>>>>>>>>>>>>> * Despite not providing atomicity guarantees outside of HDFS,
>>>>>>>>>>>>> people use the tables in unsafe situations
>>>>>>>>>>>>> * HadoopCatalog cannot implement atomic operations for rename
>>>>>>>>>>>>> and drop table, which are commonly used in data engineering
>>>>>>>>>>>>> * Alternative file names (for instance when using metadata
>>>>>>>>>>>>> file compression) also break guarantees
>>>>>>>>>>>>>
>>>>>>>>>>>>> While these tables are useful for testing in non-production
>>>>>>>>>>>>> scenarios, I think it's misleading to have them in the core 
>>>>>>>>>>>>> module because
>>>>>>>>>>>>> there's an appearance that they are a reasonable choice. I 
>>>>>>>>>>>>> propose we
>>>>>>>>>>>>> deprecate the HadoopTableOperations and HadoopCatalog 
>>>>>>>>>>>>> implementations and
>>>>>>>>>>>>> move them to tests the next time we can make breaking API changes 
>>>>>>>>>>>>> (2.0).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think we should also consider similar fixes to the table
>>>>>>>>>>>>> spec. It currently describes how HadoopTableOperations works, 
>>>>>>>>>>>>> which does
>>>>>>>>>>>>> not work in object stores or local file systems. HDFS is becoming 
>>>>>>>>>>>>> much less
>>>>>>>>>>>>> common and I propose that we note that the strategy in the spec 
>>>>>>>>>>>>> should ONLY
>>>>>>>>>>>>> be used with HDFS.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What do other people think?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Xuanwo
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://xuanwo.io/
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Databricks
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Databricks
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Databricks
>>
>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Reply via email to