Thank you for your reply.

Column-level comments are already part of the schema definition. Would adding 
just one table-level comment really cause noticeable bloat? For example, if a 
table has 20 columns, adding one more comment would only increase the metadata 
size by about 1/20th.

Also, using schema-id as part of the property key feels like a workaround 
rather than a proper solution. It is not part of the specification, so any tool 
or integration (including LLM-based ones) would need extra logic to interpret 
it. A standardized, schema-level field would avoid that complexity and make the 
metadata easier to consume consistently.

If bloat is a real concern, perhaps column-level comments should also be moved 
out of the schema, with a proper mechanism to version and manage them 
separately.

Thank you,
Taeyun.

-----Original Message-----
From: "Gang Wu" <ust...@gmail.com>
To: <dev@iceberg.apache.org>;
Cc:
Sent: 2025-07-25 (금) 11:20:08 (UTC+09:00)
Subject: Re: Thoughts on Adding a `doc` Property for Schema Objects


I'd rather not complicate the schema definitions in the table metadata. You may 
append `schema-id` to the key of table property to manage different schema 
versions. 


Storing verbose text to each field may bloat the metadata storage, especially 
when there are a lot of duplicate `doc`s if schema evolution happens a lot.


Best,
Gang


On Fri, Jul 25, 2025 at 9:25 AM Taeyun Kim <taeyun....@innowireless.com> wrote:

Thank you for your response.
As I understand it, the table description is currently stored as a table 
property within the table metadata’s `properties` map.

In my opinion, this approach has a few issues:

- Table metadata `properties` are not versioned. As a result, when querying an 
older snapshot, the description may be inaccurate because the value reflects 
only the current state.
- According to the specification, the purpose of table metadata properties is: 
“A string to string map of table properties. This is used to control settings 
that affect reading and writing and is not intended to be used for arbitrary 
metadata.” Based on this, a comment seems to fall under “arbitrary metadata,” 
and therefore may not be an appropriate use of properties.
- Table comments seem to have become significant enough that relying on a 
convention alone may no longer be sufficient. It might be worth considering a 
standardized, schema-level field for them.

Thank you.
Taeyun

-----Original Message-----
From: "Ryan Blue" <rdb...@gmail.com>
To: <dev@iceberg.apache.org>;
Cc:
Sent: 2025-07-25 (금) 08:48:48 (UTC+09:00)
Subject: Re: Thoughts on Adding a `doc` Property for Schema Objects


Iceberg does allow you to store table descriptions. The convention is to use a 
table property, "comment". While this isn't a schema-level doc/comment, I don't 
know of anything that makes a distinction between schema description and table 
description, so I think it should work for your use.



On Tue, Jul 22, 2025 at 5:48 PM 김태연 (Taeyun Kim) <taeyun....@innowireless.com> 
wrote:

Hi,

With the growing trend of using LLMs to automatically generate SQL, it feels 
increasingly important to manage descriptions of database tables and columns in 
a way that these tools can easily access.

In the Iceberg specification, comments for schema fields (i.e., columns) can be 
specified using the `doc` property within the `fields` array of a `struct` 
type. However, there doesn’t seem to be a way to specify a comment for the root 
struct type itself - that is, for the table as a whole.

From what I can tell, OLAP DBMSs today may handle table-level comments by 
storing them in the `properties` map within the table metadata under various 
non-standard keys. But since a table comment conceptually belongs to the 
schema, and can vary by schema, it feels like the `properties` map within the 
table metadata might not be the best place for it.

Would it make sense to allow a `doc` property on the `schema` object (the root 
struct type), alongside `schema-id` and `identifier-field-ids`, so that a 
description for the schema itself can be included?
It seems like it would be helpful, especially for tooling and LLM-related use 
cases.

Curious to hear your thoughts.
Apologies if I’m overlooking something or if this has already been discussed.

Thank you,
Taeyun

Reply via email to