adamreeve commented on issue #15216:
URL: https://github.com/apache/datafusion/issues/15216#issuecomment-2739092901

   I had a go at seeing if I could use this callback based configuration 
approach to integrate with encryption without Datafusion needing to know 
anything about Parquet encryption.
   
   I tested with Rok's [current encryption 
branch](https://github.com/rok/arrow-rs/tree/a522ab105fd7f2b8d4ef227b2b75489a93a137af),
 and tried both reading and writing encrypted files. I wrote an example that 
nearly works here: 
https://github.com/adamreeve/datafusion/blob/encryption_poc/datafusion-examples/examples/parquet_encryption.rs
   
   One obvious downside is that this isn't compatible with conversion of plans 
to protobuf. I've just ignored the new config fields there, although ideally we 
would at least raise an error if they're set and we try to convert a plan to 
protobuf, but that might require adding a "parquet" feature to the protobuf 
crates.
   
   I think this is fine though as long as you don't want to use this in a 
distributed query engine.
   
   I could get writing of encrypted Parquet working, but reading fails when 
trying to infer the schema, as this uses a `ParquetMetaDataReader` 
([here](https://github.com/apache/datafusion/blob/19dd46d3c65e3160c0fb949880bc3e960434e96d/datafusion/datasource-parquet/src/file_format.rs#L765))
 which doesn't know about the `ArrowReaderOptions` but instead uses 
`FileDecryptionProperties` directly 
([here](https://github.com/apache/arrow-rs/blob/660a3ac22a8ef8601acf4548d65146bc623f653a/parquet/src/file/metadata/reader.rs#L80)).
   
   I'm not sure how best to work around that. Maybe arrow-rs could be 
refactored to support reader options that aren't Arrow specific, that could be 
passed to the `ParquetMetaDataReader`?
   
   Or maybe Datafusion could change how metadata is read and use 
`ParquetObjectReader::get_metadata_with_options` instead? I made a brief 
attempt at that but didn't get very far. There is a comment on 
`fetch_parquet_metadata` though that says "This component is a subject to 
**change** in near future"...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to