AdamGS opened a new issue, #14607:
URL: https://github.com/apache/datafusion/issues/14607

   ### Is your feature request related to a problem or challenge?
   
   We’re implementing a file format 
[Vortex](https://github.com/spiraldb/vortex), which has no “row groups” or 
similar concept, meaning byte range might fall completely within one column, 
and aligning columns is a non trivial task. I would like to be able express 
repartitioning logic to only split files logically (by rows and not by bytes).
   The existing repartitioning logic in Datafusion (specifically 
`FileGroupPartitioner` and `FileScanConfig::repartitioned`) assume that files 
can be split logically by byte ranges (`FileRange`), and even the rustdoc on it 
seems very Parquet-specific (even though other formats do support it). This 
assumes some mapping/alignment between the physical layout and the logical one.
   
   
   ### Describe the solution you'd like
   
   Seems like the best way would be to configure `FileGroupPartitioner` through 
`FileSource`. The other option would be to make `FileRange` an enum, but that 
would still mean we (and any other format with a similar structure) will have 
to maintain our own repartitioning logic. 
   
   ### Describe alternatives you've considered
   
   We can keep the current state, which is maintaining our own repartitioning 
logic and eventually just reusing FileRange to describe row splits.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to