Hi Vladimir,

As Ryan said, it's not a bug: CREATE OR REPLACE can be seen as "CREATE
AND UPDATE" from table format perspective. Specifically for the
properties, it makes sense to not delete the current properties as it
can be used in several use cases (security, tables grouping, ...).
I'm not sure a REST Spec update is required, probably more on the
engine side. In the REST Spec, you can create a table
(https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L553)
and update a table
(https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L975),
and it's up to the query engine to implement the "CREATE OR REPLACE"
with the correct semantic.

Regards
JB

On Sun, Oct 20, 2024 at 9:26 PM Vladimir Ozerov <voze...@querifylabs.com> wrote:
>
> Hi Ryan,
>
> Thanks for the clarification. Yes, I think my confusion was caused by the 
> fact that many engines treat CREATE OR REPLACE as a semantic equivalent of 
> DROP + CREATE, which is performed atomically (e.g., Flink [1]). Table formats 
> add history on top of that, which is expected to be retained, no questions 
> here. Permission propagation also make sense. For properties things become a 
> bit blurry, because on the one hand there are Iceberg specific properties, 
> which may affect table maintenance, and on the other hand there are 
> user-defined properties in the same bag. The question appeared in the first 
> place because I observed a discrepancy in Trino: all catalogs except for REST 
> completely overrides table properties on REPLACE, and REST catalog merges 
> them, which might be confusing to end users. Perhaps some clarification at 
> the spec level might be useful, because without agreement between engines the 
> could be some subtle bugs in multi-engine environments, such as sudden data 
> format changes between replaces, etc.
>
> [1] 
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#create-or-replace-table
>
> Regards,
> Vladimir.
>
> On Sun, Oct 20, 2024 at 9:20 PM rdb...@gmail.com <rdb...@gmail.com> wrote:
>>
>> Hi Vladimir,
>>
>> This isn't a bug. The behavior of CREATE OR REPLACE is to replace the data 
>> of a table, but to maintain things like other refs, snapshot history, 
>> permissions (if supported by the catalog), and table properties. Table 
>> properties are replaced if they are set in the operation like `b` in your 
>> example. This is not the same as a drop and create, which may be what you 
>> want instead.
>>
>> The reason for this behavior is that the CREATE OR REPLACE operation is used 
>> to replace a table's data without needing to handle schema changes between 
>> versions. For example, producing a daily report table that replaces the 
>> previous day. However, the table still exists and it is valuable to be able 
>> to time travel to older versions or to be able to use branches and tags. 
>> Clearly, that means that table history and refs stick around so the table is 
>> not completely new every time it is replaced.
>>
>> Adding on to that, properties control things like ref and snapshot 
>> retention, file format, compression, and other settings. These aren't 
>> settings that need to be carried through in every replace operation. And it 
>> would make no sense if you set the snapshot retention because older 
>> snapshots are retained, only to have it discarded the next time you replace 
>> the table data. A good way to think about this is that table properties are 
>> set infrequently, while table data changes regularly. And the person 
>> changing the data may not be the person tuning the table settings.
>>
>> Hopefully that helps,
>>
>> Ryan
>>
>> On Sun, Oct 20, 2024 at 9:45 AM Vladimir Ozerov <voze...@querifylabs.com> 
>> wrote:
>>>
>>> Hi,
>>>
>>> Consider a REST catalog and a user calls "CREATE OR REPLACE <table>" 
>>> command. When processing the command, engines will usually initiate a 
>>> "createOrReplace" transaction and add metadata, such as the properties of a 
>>> new table.
>>>
>>> Users expect a table to be replaced with a new one if it exists, including 
>>> properties. However, I observe the following:
>>>
>>> RESTSessionCatalog loads previous table metadata, adds new properties 
>>> (MetadataUpdate.SetProperties), and invokes the backend
>>> The backend (e.g., Polaris) will typically invoke 
>>> "CatalogHandler.updateTable." There, the previous table state, including 
>>> its properties, is loaded
>>> Finally, metadata updates are applied, and old table properties are merged 
>>> with new ones. That is, if the old table has properties [a=1, b=2], and the 
>>> new table has properties [b=3, c=4], then the final properties would be 
>>> [a=1, b=3, c=4], while the user expects [b=3, c=4].
>>>
>>> It looks like a bug because the user expects complete property replacement 
>>> instead of a merge. Shall we explicitly clear all previous properties in 
>>> RESTSessionCatalog.Builder.replaceTransaction?
>>>
>>> Regards,
>>> Vladimir.
>>>
>>>
>>>
>>> --
>>> Vladimir Ozerov
>>> Founder
>>> querifylabs.com
>
>
>
> --
> Vladimir Ozerov
> Founder
> querifylabs.com

Reply via email to