[I] Return the "position" of rows in parquet files after performing a query. [datafusion]

via GitHub Tue, 05 Nov 2024 10:58:26 -0800


adamfaulkner-at opened a new issue, #13261:
URL: https://github.com/apache/datafusion/issues/13261


   ### Is your feature request related to a problem or challenge?
   
   Hello! I'm working on a database, using the delta lake format with 
datafusion as the query engine. I'd like to implement support for writing 
[deletion 
vectors](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vector-format)
 in delta lake when a row is deleted from my database. There's a very similar 
feature [in iceberg](https://iceberg.apache.org/spec/#deletion-vectors) that 
seems to work in exactly the same way.
   
   The general idea is that a deletion vector encodes a bitmap for which rows 
in a parquet file are no longer valid and should be filtered out of any query 
results. That is, if the bit in position P is set, then the P'th row in the 
corresponding parquet file should be filtered out of query results.
   
   AFAICT, the APIs already exist to enable this on the read side (see 
[spiceai](https://github.com/spiceai/spiceai/pull/1891/files) for example), but 
it's challenging to implement this on the write side because there's no obvious 
way to get the position of a row in a parquet file. The best idea I've come up 
with is to always sort my parquet files prior to writing them, and use a 
function like `ROW_NUMBER` to figure out the positions of rows. It would be 
great if the parquet reader machinery could expose this information directly 
instead.
   
   ### Describe the solution you'd like
   
   I'm not sure what a good API would look like here, but one idea is that the 
parquet reader could expose some new option that enables row position 
information to be returned as some special column name. I.E.
   
   ```rust
   let ctx = 
SessionContext::new_with_config(SessionConfig::default().set_bool("datafusion.execution.parquet.include_row_position",
 true))
   let record_batches = 
ctx.read_parquet("foo.parquet").filter(filters).select(PARQUET_ROW_POSITION).collect();
   // record batches now contains the indexes of rows in "foo.parquet" that 
match the provided filters.
   ```
   
   Another potential API could be to provide an alternative table provider 
which augments a parquet file with row numbers, without breaking when 
predicates are pushed down.
   
   
   ### Describe alternatives you've considered
   
   I'm considering doing the equivalent of this SQL:
   
   ```
   SELECT row_number FROM
     (SELECT ROW_NUMBER() OVER (ORDER BY pk ASC) as row_number, c1, c2 (... all 
columns relevant for filtering) FROM table)
   WHERE filters;
   ```
   I assume this means that indexes and pruning will not happen, and this will 
likely not perform very well.
   
   This requires that every file that I write be ordered by some `pk`. This is 
probably OK.
   
   ### Additional context
   
   I'm sure that the `delta-rs` and `iceberg-rust` projects will eventually 
want a feature like this. Neither project seems to be implementing deletion 
vector writes quite yet, but something like this will be highly useful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Return the "position" of rows in parquet files after performing a query. [datafusion]

Reply via email to