alamb commented on issue #13261: URL: https://github.com/apache/datafusion/issues/13261#issuecomment-2460335893
> I agree with the assessment that the information must be coning from the file reader itself. I also agree with this assessment In general I am not sure a SQL level solution will work well in general. Some challenges: - `ctx.read_parquet("foo.parquet")` may read the file in parallel, interleaving the rows - `ctx.read_parquet("<directory>")` can read more than one file and the row off set / position are per file However, the DataFrame API you sketch out above seems reasonable and a relatively small part THe other systems I know that support Delete Vectors (e.g. Vectica) basically have 1. A special flag on the scan node (`ParquetExec` in DataFusion) that says to emit positions (in addition to potentially adding filters, etc) 2. A special operator that knows how to take a stream of positions and encode them as whatever delete vectory format there is. So in DataFusion this might look more like adding a method to `TableProvider` like `TableProvider::delete_from` similar to https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html#method.insert_into And then each table provider would implement whatever API (which would likely involve positions as you describe) This would allow DataFusion to handle the planning of DELETE -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org