animodak7 opened a new issue, #18482:
URL: https://github.com/apache/datafusion/issues/18482

   ### Is your feature request related to a problem or challenge?
   
   ### Is your feature request related to a problem or challenge?
   
   We want to read partition files with extended col values which are generated 
for each file on runtime. Currently when listing table partition contains 
multiple files, loadNextBatch will not have knowledge of from which file it is 
returning the values. Hence it is not possible to append data once we get scan 
results in record batch. We cannot maintain these values in file since these 
are generated on runtime. There should be some way to extended file schema and 
file stream with additional cols(similar to table_partition_values)
   
   Eg.  partition directory - /data1/
   3 files in /data1/
   /data1/file1, /data1/file2, /data1/file3.
   file schema is { row_id: Int32, b: Int32} 
   Now if we want to perform some operation based on have cumulative_total_rows 
in files. 
   file1 has 5 rows - cumulative_total_rows(5), file2 has 2 rows - 
cumulative_total_rows(5 + 2 = 7), file3 has 10 rows - files 
cumulative_total_rows(5 + 2 + 10 = 17).
   derived schema should be { row_id: Int32, b: Int32, cumulative_total_rows}
   
   We should be able to evaluate expr `row_id + cumulative_total_rows`
   
   ### Describe the solution you'd like
   
   DataFusion listing table has partitioning support using table_partition_cols 
which pulls partition values from folder structure and add those to file 
schema. Values for table partition cols are populated internally while 
listingTable.scan(). Since partition values are available in file stream and 
file schema its possible to use them in  expressions. 
   We want to extend this functionality to support passing the extended_cols in 
TableScanConfig. These would differ from table_partition_cols since instead of 
deriving from table_path, user is providing the values. But similar to 
table_partition_cols, extended_cols will be appended to file_stream and 
file_schema.
   
   Add another parameter in ListingOptions to get input for extended_cols, 
which should have col_name and col_values map, col_values must map col_values 
with file_name for files present in ListingTableURL directory.
   
   ### Describe alternatives you've considered
   
   Any other solution which would provide some info in RecordBatch results 
which can be used to distinguish which file partition data is streaming from 
would work.
   Alternative - 
   Allow PhysicalExprAdapter to use ObjectMeta. Currently PhysicalExprAdapter 
has with_paritition_values, with access to ObjectMeta we can append file_name 
to file_stream. With another MemTable with file_name+ extended cols can be 
joined with current Listing Table to get desired combination of cols.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to