adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2646833158
> when I save a table to csv, it will also save rowid into csv. no system will do like this. My problem is with this statement. I don't think there's a universal definition and use case for "system columns". Spark has one. Postgres has another. Our system has another. You use `_rowid` as an example. Is that the `_rowid` within a single file? Or is that the `_rowid` of the entire table (similar to Postgres' `ctid`)? I think it's reasonable for both to exist and for both to be considered system columns. The former does somewhat "loose" it's meaning when copied through a query from one file to another and it only really makes sense to generate it dynamically when reading a file. The latter could be copied from one file to another without issues. In our case we use system columns to speed up access to JSON: we take a row with json data such as `json_col: text = [{"a": 1, "b": "lorem"}, {"a": 2}]` and split it into `_lf__json_col__a: int = [1, 2]` and `__lf__json_col__b: text = ["lorem", null]`. This is well known technique, it's basically [what ClickHouse does](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse). We write these to files (they are not dynamically generated) and want them to be treated as normal columns when reading/writing. We just don't want them to show up when a user does `select *`. Is this not a valid use case for system columns? My thought is to establish a piece of metadata marking a column as a system column with the implementation doing nothing beyond excluding them from `select *` unless they are explicitly included. That seems to me like a universally agreed upon thing to do with system columns. Anything else that is not part of a universal definition of a system column is IMO something that should be implemented system by system by rewriting logical plans, customizing reading and writing, etc. Having it as field metadata means this information should be accessible from most hook points in DataFusion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org