animodak7 opened a new issue, #18482:
URL: https://github.com/apache/datafusion/issues/18482
### Is your feature request related to a problem or challenge?
### Is your feature request related to a problem or challenge?
We want to read partition files with extended col values which are generated
for each file on runtime. Currently when listing table partition contains
multiple files, loadNextBatch will not have knowledge of from which file it is
returning the values. Hence it is not possible to append data once we get scan
results in record batch. We cannot maintain these values in file since these
are generated on runtime. There should be some way to extended file schema and
file stream with additional cols(similar to table_partition_values)
Eg. partition directory - /data1/
3 files in /data1/
/data1/file1, /data1/file2, /data1/file3.
file schema is { row_id: Int32, b: Int32}
Now if we want to perform some operation based on have cumulative_total_rows
in files.
file1 has 5 rows - cumulative_total_rows(5), file2 has 2 rows -
cumulative_total_rows(5 + 2 = 7), file3 has 10 rows - files
cumulative_total_rows(5 + 2 + 10 = 17).
derived schema should be { row_id: Int32, b: Int32, cumulative_total_rows}
We should be able to evaluate expr `row_id + cumulative_total_rows`
### Describe the solution you'd like
DataFusion listing table has partitioning support using table_partition_cols
which pulls partition values from folder structure and add those to file
schema. Values for table partition cols are populated internally while
listingTable.scan(). Since partition values are available in file stream and
file schema its possible to use them in expressions.
We want to extend this functionality to support passing the extended_cols in
TableScanConfig. These would differ from table_partition_cols since instead of
deriving from table_path, user is providing the values. But similar to
table_partition_cols, extended_cols will be appended to file_stream and
file_schema.
Add another parameter in ListingOptions to get input for extended_cols,
which should have col_name and col_values map, col_values must map col_values
with file_name for files present in ListingTableURL directory.
### Describe alternatives you've considered
Any other solution which would provide some info in RecordBatch results
which can be used to distinguish which file partition data is streaming from
would work.
Alternative -
Allow PhysicalExprAdapter to use ObjectMeta. Currently PhysicalExprAdapter
has with_paritition_values, with access to ObjectMeta we can append file_name
to file_stream. With another MemTable with file_name+ extended cols can be
joined with current Listing Table to get desired combination of cols.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]