alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-3016555361
I have been thinking about this one a lot and I am sorry I haven't written thus up before now. I was trying to collect my thoughts. I feel the core isssue is a tension between 1. `datafusion-cli` as a easy to use, fully pre-integrated tool for querying files (locally and remotely) 2. Using `datafusion-cli` as a testing vehicle for DataFusion development itself. 3. Using `datafusion-cli` as an example of how to integrate various DataFusion features (like aws s3, etc) `datafusion-cli` already has quite a few features that are outside the core usecase of testing DataFusion (e.g. aws s3 auth support) The more I think about it, tntegrating easy to use tpch functions into `datafusion-cli` feels like it is part of the first and thus maybe doesn't actually belong in the datafusion repository's `datafuson-cli` itself after all Some possible paths forward (not mutually exclusive) 1. Document how to use `tpchgen-rs` to create TPCH data that can be queried by `datafusion-cli` (somewhere [in the cli docs](https://datafusion.apache.org/user-guide/cli/index.html)) but don't actually add more code to datafusion-cli 1. Move the https://github.com/clflushopt/datafusion-tpch repo into the `datafusion-contrib` github organization so it is more discoverable 2. Actually implement the datafusion tpch functions from https://github.com/clflushopt/datafusion-tpch into the core datafusion repository (along with a dependency on tpchgen-rs). 3. Create a new repository for a `datafusion-cli++` (probably need a better name) with the explicit goal of being a fully pre-integrated CLI experience like `duckddb` I have been dreaming about the `datafusion-cli++` idea for a while now too. I think it would be really cool technically to build a tool that was able to query remote sources really quickly and easily (aka kind of a `daft` like experience) -- think caching parquet metadata, catalog, iceberg, etc. But I get ahead of myself and I haven't convinced myself this can be done in a reaonable amount of time @matthewmturner is working on something similar in https://github.com/datafusion-contrib/datafusion-dft but that also includes other things such as as tui (in fact he seems to be using tpchgen as well: https://github.com/datafusion-contrib/datafusion-dft/pull/331 :) ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org