alamb opened a new issue, #16303: URL: https://github.com/apache/datafusion/issues/16303
### Is your feature request related to a problem or challenge? This is an idea that @robtandy brought up on the DataFusion sync call the other day and I think it would be pretty useful. The usecase is "I want to read more than 1 but not an entire directory of parquet files from remote object store" -- I think in this case to look at some particular files For example, let's say you want to read just these two files: * s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet * s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_2.parquet There is currently no way to do so via SQL. You can either do the entire directory ```sql > CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/' options (aws.region 'eu-central-1'); 0 row(s) fetched. Elapsed 2.928 seconds. ``` Or you can read each file separately ```sql > CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet' options (aws.region 'eu-central-1'); 0 row(s) fetched. Elapsed 1.017 seconds. ``` ### Describe the solution you'd like I would like to be able to read an arbitrary set of remote parquet files It would also be awesome to support GLOB files (e.g. `*`) which has been requested before - https://github.com/apache/datafusion/issues/7393 ### Describe alternatives you've considered I suggest we implement a `TableFunction` similar to the DuckDB `read_parquet` file ONLY in the `datafusion-cli` source So to query the files listed above, this would look like ```sql SELECT * FROM read_parquet([ 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet', 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_2.parquet' ]); ``` From the duckdb docs: https://duckdb.org/docs/stable/data/parquet/overview.html ```sql -- read file1, file2, file3 SELECT * FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']); -- Support GLOB access SELECT * FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']); ``` We already support the `parquet_metadata` function in datafusion-cli ([docs](https://datafusion.apache.org/user-guide/cli/usage.html#parquet-metadata)) ```sql SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size FROM parquet_metadata('hits.parquet') WHERE path_in_schema = '"WatchID"' LIMIT 3; +----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ | path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size | +----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ | "WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 | | "WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 | | "WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 | +----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ 3 rows in set. Query took 0.053 seconds. ``` Here is the code implementation: https://github.com/apache/datafusion/blob/85f6621a6b1680b40d483a56b10ff3495861ece3/datafusion-cli/src/functions.rs#L322 We can also look at the `ClickBench` S3 command that is similar: https://clickhouse.com/docs/integrations/s3 ```sql DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_*.gz', 'TabSeparatedWithNames'); ``` ## Open questions What to do if the files are on different object stores (e.g. S3 and http): ```sql SELECT * FROM read_parquet([ 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet', -- note a different object store 's3://public-datasets/hits_compatible/athena_partitioned/hits_2.parquet' ]); ``` At first I suggest we don't try and support this ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org