Thanks Vladimir! Would you like to open a PR to make that change? It sounds like another good item to put into the "Implementation notes" section.
On Sun, Oct 20, 2024 at 11:41 PM Vladimir Ozerov <voze...@querifylabs.com> wrote: > Hi Jean-Baptiste, > > Agreed. REST spec looks good. I am talking about the general spec, where > it might be useful to add a hint to engine developers, that CREATE OR > REPLACE semantics in Iceberg is expected to follow slightly different > semantics. This is already broken in Trino: depending on catalog type users > may get either classical "DROP + CREATE" (for non-REST catalogs), or > "CREATE AND UPDATE" for REST catalog. For Flink, their official docs say > that CREATE OR REPLACE == DROP + CREATE, while for Iceberg tables this > should not be the case. These are definitively things that should be fixed > at engine levels. But at the same time it highlights that engine developers > are having hard time defining proper semantics for CREATE OR REPLACE in the > Iceberg integrations, so a paragraph or so in the main Iceberg spec may > help us align expectations. > > Regards, > Vladimir. > > On Mon, Oct 21, 2024 at 8:28 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Hi Vladimir, >> >> As Ryan said, it's not a bug: CREATE OR REPLACE can be seen as "CREATE >> AND UPDATE" from table format perspective. Specifically for the >> properties, it makes sense to not delete the current properties as it >> can be used in several use cases (security, tables grouping, ...). >> I'm not sure a REST Spec update is required, probably more on the >> engine side. In the REST Spec, you can create a table >> ( >> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L553 >> ) >> and update a table >> ( >> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L975 >> ), >> and it's up to the query engine to implement the "CREATE OR REPLACE" >> with the correct semantic. >> >> Regards >> JB >> >> On Sun, Oct 20, 2024 at 9:26 PM Vladimir Ozerov <voze...@querifylabs.com> >> wrote: >> > >> > Hi Ryan, >> > >> > Thanks for the clarification. Yes, I think my confusion was caused by >> the fact that many engines treat CREATE OR REPLACE as a semantic equivalent >> of DROP + CREATE, which is performed atomically (e.g., Flink [1]). Table >> formats add history on top of that, which is expected to be retained, no >> questions here. Permission propagation also make sense. For properties >> things become a bit blurry, because on the one hand there are Iceberg >> specific properties, which may affect table maintenance, and on the other >> hand there are user-defined properties in the same bag. The question >> appeared in the first place because I observed a discrepancy in Trino: all >> catalogs except for REST completely overrides table properties on REPLACE, >> and REST catalog merges them, which might be confusing to end users. >> Perhaps some clarification at the spec level might be useful, because >> without agreement between engines the could be some subtle bugs in >> multi-engine environments, such as sudden data format changes between >> replaces, etc. >> > >> > [1] >> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#create-or-replace-table >> > >> > Regards, >> > Vladimir. >> > >> > On Sun, Oct 20, 2024 at 9:20 PM rdb...@gmail.com <rdb...@gmail.com> >> wrote: >> >> >> >> Hi Vladimir, >> >> >> >> This isn't a bug. The behavior of CREATE OR REPLACE is to replace the >> data of a table, but to maintain things like other refs, snapshot history, >> permissions (if supported by the catalog), and table properties. Table >> properties are replaced if they are set in the operation like `b` in your >> example. This is not the same as a drop and create, which may be what you >> want instead. >> >> >> >> The reason for this behavior is that the CREATE OR REPLACE operation >> is used to replace a table's data without needing to handle schema changes >> between versions. For example, producing a daily report table that replaces >> the previous day. However, the table still exists and it is valuable to be >> able to time travel to older versions or to be able to use branches and >> tags. Clearly, that means that table history and refs stick around so the >> table is not completely new every time it is replaced. >> >> >> >> Adding on to that, properties control things like ref and snapshot >> retention, file format, compression, and other settings. These aren't >> settings that need to be carried through in every replace operation. And it >> would make no sense if you set the snapshot retention because older >> snapshots are retained, only to have it discarded the next time you replace >> the table data. A good way to think about this is that table properties are >> set infrequently, while table data changes regularly. And the person >> changing the data may not be the person tuning the table settings. >> >> >> >> Hopefully that helps, >> >> >> >> Ryan >> >> >> >> On Sun, Oct 20, 2024 at 9:45 AM Vladimir Ozerov < >> voze...@querifylabs.com> wrote: >> >>> >> >>> Hi, >> >>> >> >>> Consider a REST catalog and a user calls "CREATE OR REPLACE <table>" >> command. When processing the command, engines will usually initiate a >> "createOrReplace" transaction and add metadata, such as the properties of a >> new table. >> >>> >> >>> Users expect a table to be replaced with a new one if it exists, >> including properties. However, I observe the following: >> >>> >> >>> RESTSessionCatalog loads previous table metadata, adds new properties >> (MetadataUpdate.SetProperties), and invokes the backend >> >>> The backend (e.g., Polaris) will typically invoke >> "CatalogHandler.updateTable." There, the previous table state, including >> its properties, is loaded >> >>> Finally, metadata updates are applied, and old table properties are >> merged with new ones. That is, if the old table has properties [a=1, b=2], >> and the new table has properties [b=3, c=4], then the final properties >> would be [a=1, b=3, c=4], while the user expects [b=3, c=4]. >> >>> >> >>> It looks like a bug because the user expects complete property >> replacement instead of a merge. Shall we explicitly clear all previous >> properties in RESTSessionCatalog.Builder.replaceTransaction? >> >>> >> >>> Regards, >> >>> Vladimir. >> >>> >> >>> >> >>> >> >>> -- >> >>> Vladimir Ozerov >> >>> Founder >> >>> querifylabs.com >> > >> > >> > >> > -- >> > Vladimir Ozerov >> > Founder >> > querifylabs.com >> > > > -- > *Vladimir Ozerov* > Founder > querifylabs.com >