I second Ryan here, it would be great to clarify in the "implementation notes" section.
Thanks ! Regards JB On Wed, Oct 23, 2024 at 1:10 AM rdb...@gmail.com <rdb...@gmail.com> wrote: > > Thanks Vladimir! Would you like to open a PR to make that change? It sounds > like another good item to put into the "Implementation notes" section. > > On Sun, Oct 20, 2024 at 11:41 PM Vladimir Ozerov <voze...@querifylabs.com> > wrote: >> >> Hi Jean-Baptiste, >> >> Agreed. REST spec looks good. I am talking about the general spec, where it >> might be useful to add a hint to engine developers, that CREATE OR REPLACE >> semantics in Iceberg is expected to follow slightly different semantics. >> This is already broken in Trino: depending on catalog type users may get >> either classical "DROP + CREATE" (for non-REST catalogs), or "CREATE AND >> UPDATE" for REST catalog. For Flink, their official docs say that CREATE OR >> REPLACE == DROP + CREATE, while for Iceberg tables this should not be the >> case. These are definitively things that should be fixed at engine levels. >> But at the same time it highlights that engine developers are having hard >> time defining proper semantics for CREATE OR REPLACE in the Iceberg >> integrations, so a paragraph or so in the main Iceberg spec may help us >> align expectations. >> >> Regards, >> Vladimir. >> >> On Mon, Oct 21, 2024 at 8:28 AM Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >>> >>> Hi Vladimir, >>> >>> As Ryan said, it's not a bug: CREATE OR REPLACE can be seen as "CREATE >>> AND UPDATE" from table format perspective. Specifically for the >>> properties, it makes sense to not delete the current properties as it >>> can be used in several use cases (security, tables grouping, ...). >>> I'm not sure a REST Spec update is required, probably more on the >>> engine side. In the REST Spec, you can create a table >>> (https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L553) >>> and update a table >>> (https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L975), >>> and it's up to the query engine to implement the "CREATE OR REPLACE" >>> with the correct semantic. >>> >>> Regards >>> JB >>> >>> On Sun, Oct 20, 2024 at 9:26 PM Vladimir Ozerov <voze...@querifylabs.com> >>> wrote: >>> > >>> > Hi Ryan, >>> > >>> > Thanks for the clarification. Yes, I think my confusion was caused by the >>> > fact that many engines treat CREATE OR REPLACE as a semantic equivalent >>> > of DROP + CREATE, which is performed atomically (e.g., Flink [1]). Table >>> > formats add history on top of that, which is expected to be retained, no >>> > questions here. Permission propagation also make sense. For properties >>> > things become a bit blurry, because on the one hand there are Iceberg >>> > specific properties, which may affect table maintenance, and on the other >>> > hand there are user-defined properties in the same bag. The question >>> > appeared in the first place because I observed a discrepancy in Trino: >>> > all catalogs except for REST completely overrides table properties on >>> > REPLACE, and REST catalog merges them, which might be confusing to end >>> > users. Perhaps some clarification at the spec level might be useful, >>> > because without agreement between engines the could be some subtle bugs >>> > in multi-engine environments, such as sudden data format changes between >>> > replaces, etc. >>> > >>> > [1] >>> > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#create-or-replace-table >>> > >>> > Regards, >>> > Vladimir. >>> > >>> > On Sun, Oct 20, 2024 at 9:20 PM rdb...@gmail.com <rdb...@gmail.com> wrote: >>> >> >>> >> Hi Vladimir, >>> >> >>> >> This isn't a bug. The behavior of CREATE OR REPLACE is to replace the >>> >> data of a table, but to maintain things like other refs, snapshot >>> >> history, permissions (if supported by the catalog), and table >>> >> properties. Table properties are replaced if they are set in the >>> >> operation like `b` in your example. This is not the same as a drop and >>> >> create, which may be what you want instead. >>> >> >>> >> The reason for this behavior is that the CREATE OR REPLACE operation is >>> >> used to replace a table's data without needing to handle schema changes >>> >> between versions. For example, producing a daily report table that >>> >> replaces the previous day. However, the table still exists and it is >>> >> valuable to be able to time travel to older versions or to be able to >>> >> use branches and tags. Clearly, that means that table history and refs >>> >> stick around so the table is not completely new every time it is >>> >> replaced. >>> >> >>> >> Adding on to that, properties control things like ref and snapshot >>> >> retention, file format, compression, and other settings. These aren't >>> >> settings that need to be carried through in every replace operation. And >>> >> it would make no sense if you set the snapshot retention because older >>> >> snapshots are retained, only to have it discarded the next time you >>> >> replace the table data. A good way to think about this is that table >>> >> properties are set infrequently, while table data changes regularly. And >>> >> the person changing the data may not be the person tuning the table >>> >> settings. >>> >> >>> >> Hopefully that helps, >>> >> >>> >> Ryan >>> >> >>> >> On Sun, Oct 20, 2024 at 9:45 AM Vladimir Ozerov >>> >> <voze...@querifylabs.com> wrote: >>> >>> >>> >>> Hi, >>> >>> >>> >>> Consider a REST catalog and a user calls "CREATE OR REPLACE <table>" >>> >>> command. When processing the command, engines will usually initiate a >>> >>> "createOrReplace" transaction and add metadata, such as the properties >>> >>> of a new table. >>> >>> >>> >>> Users expect a table to be replaced with a new one if it exists, >>> >>> including properties. However, I observe the following: >>> >>> >>> >>> RESTSessionCatalog loads previous table metadata, adds new properties >>> >>> (MetadataUpdate.SetProperties), and invokes the backend >>> >>> The backend (e.g., Polaris) will typically invoke >>> >>> "CatalogHandler.updateTable." There, the previous table state, >>> >>> including its properties, is loaded >>> >>> Finally, metadata updates are applied, and old table properties are >>> >>> merged with new ones. That is, if the old table has properties [a=1, >>> >>> b=2], and the new table has properties [b=3, c=4], then the final >>> >>> properties would be [a=1, b=3, c=4], while the user expects [b=3, c=4]. >>> >>> >>> >>> It looks like a bug because the user expects complete property >>> >>> replacement instead of a merge. Shall we explicitly clear all previous >>> >>> properties in RESTSessionCatalog.Builder.replaceTransaction? >>> >>> >>> >>> Regards, >>> >>> Vladimir. >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Vladimir Ozerov >>> >>> Founder >>> >>> querifylabs.com >>> > >>> > >>> > >>> > -- >>> > Vladimir Ozerov >>> > Founder >>> > querifylabs.com >> >> >> >> -- >> Vladimir Ozerov >> Founder >> querifylabs.com