[I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

via GitHub Fri, 06 Jun 2025 06:44:38 -0700


alamb opened a new issue, #16303:
URL: https://github.com/apache/datafusion/issues/16303


   ### Is your feature request related to a problem or challenge?
   
   This is an idea that @robtandy brought up on the DataFusion sync call the 
other day and I think it would be pretty useful.
   
   The usecase is "I want to read more than 1 but not an entire directory of 
parquet files from remote object store" -- I think in this case to look at some 
particular files
   
   For example, let's say you want to read just these two files:
   * 
s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet
   * 
s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_2.parquet
   
   There is currently no way to do so via SQL. You can either do the entire 
directory 
   
   ```sql
   > CREATE EXTERNAL TABLE hits
   STORED AS PARQUET
   LOCATION 
's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/' options 
(aws.region 'eu-central-1');
   0 row(s) fetched.
   Elapsed 2.928 seconds.
   ```
   
   Or you can read each file separately
   ```sql
   > CREATE EXTERNAL TABLE hits
   STORED AS PARQUET
   LOCATION 
's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet'
 options (aws.region 'eu-central-1');
   0 row(s) fetched.
   Elapsed 1.017 seconds.
   ```
   
   
   ### Describe the solution you'd like
   
   I would like to be able to read an arbitrary set of remote parquet files
   
   It would also be awesome to support GLOB files (e.g. `*`)  which has been  
requested before
   - https://github.com/apache/datafusion/issues/7393
   
   ### Describe alternatives you've considered
   
   I suggest we  implement a `TableFunction` similar to the DuckDB 
`read_parquet` file ONLY in the `datafusion-cli` source
   
   So to query the files listed above, this would look like
   ```sql
   SELECT * FROM read_parquet([
     
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet',
 
     
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_2.parquet'
   ]);
   ```
   
   
   From the duckdb docs: 
https://duckdb.org/docs/stable/data/parquet/overview.html
   
   ```sql
   -- read file1, file2, file3
   SELECT *
   FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']);
   -- Support GLOB access
   SELECT *
   FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);
   ```
   We already support the `parquet_metadata` function in datafusion-cli 
([docs](https://datafusion.apache.org/user-guide/cli/usage.html#parquet-metadata))
   
   ```sql
   SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, 
stats_max, total_compressed_size
   FROM parquet_metadata('hits.parquet')
   WHERE path_in_schema = '"WatchID"'
   LIMIT 3;
   
   
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
   | path_in_schema | row_group_id | row_group_num_rows | stats_min           | 
stats_max           | total_compressed_size |
   
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
   | "WatchID"      | 0            | 450560             | 4611687214012840539 | 
9223369186199968220 | 3883759               |
   | "WatchID"      | 1            | 612174             | 4611689135232456464 | 
9223371478009085789 | 5176803               |
   | "WatchID"      | 2            | 344064             | 4611692774829951781 | 
9223363791697310021 | 3031680               |
   
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
   3 rows in set. Query took 0.053 seconds.
   ```
   
   Here is the code implementation:
   
https://github.com/apache/datafusion/blob/85f6621a6b1680b40d483a56b10ff3495861ece3/datafusion-cli/src/functions.rs#L322
   
   We can also look at the `ClickBench` S3 command that is similar: 
https://clickhouse.com/docs/integrations/s3
   
   ```sql
   DESCRIBE TABLE 
s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_*.gz',
 'TabSeparatedWithNames');
   ```
   
   ## Open questions
   What to do if the files are on different object stores (e.g. S3 and http):
   ```sql
   SELECT * FROM read_parquet([
     
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet',
 
     -- note a different object store 
     's3://public-datasets/hits_compatible/athena_partitioned/hits_2.parquet'
   ]);
   ```
   
   At first I suggest we don't try and support this
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

Reply via email to