adamreeve commented on issue #15216:
URL: https://github.com/apache/datafusion/issues/15216#issuecomment-2852529965

   > Here is how spark does encryption configuration
   
   My understanding of how this works in Spark from reading this and looking at 
some of the code:
   * Spark requires specifying a class used to generate file encryption and/or 
decryption properties. This is configured with the 
`spark.hadoop.parquet.crypto.factory.class` config setting, and the class needs 
to implement `EncryptionPropertiesFactory` and/or `DecryptionPropertiesFactory` 
to generate file encryption or decryption properties as required. The class 
gets access to extra context like the file schema so it knows what columns to 
provide keys for (see the 
[getFileEncryptionProperties](https://github.com/apache/parquet-java/blob/142bff02b09c468783f11f452b3dec9174c56a2a/parquet-hadoop/src/main/java/org/apache/parquet/crypto/EncryptionPropertiesFactory.java#L94-L110)
 method).
   * Spark supports using the KMS based API by providing a built-in 
`PropertiesDrivenCryptoFactory` class that implements 
`EncryptionPropertiesFactory` and `DecryptionPropertiesFactory`. This requires 
also specifying a `KmsClient` implementation with the 
`spark.hadoop.parquet.encryption.kms.client.class` key, and this class must be 
defined by users (only a mock `InMemoryKMS` class is provided for testing).
   * In theory a user could also define their own class that implements 
`EncryptionPropertiesFactory` and `DecryptionPropertiesFactory` if they don't 
want to use the KMS based API, for example if they want to define AES keys 
directly.
   
   Starting with similarly flexible `EncryptionPropertiesFactory` and 
`DecryptionPropertiesFactory` traits in Datafusion seems like a reasonable 
approach to me.
   
   I'm not that familiar with Java, but from what I understand it's 
straightforward to define your own `KmsClient` in a JAR and then include that 
at runtime so it's discoverable by the configuration mechanism. This approach 
doesn't really translate to Rust though. If any custom code is needed it will 
need to be compiled in unless we use something like WebAssembly or an FFI, but 
that seems overly complicated and unnecessary. We could maintain some level of 
string-configurability by letting users statically register named 
implementations of traits in code and then reference these in configuration 
strings. Corwin mentioned the `typetag` crate that can automate this, or it 
could be more manual.
   
   > I personally suggest using the `Arc<dyn Any>` approach
   
   I don't really understand the reason for using `Any` rather than a trait 
like `Arc<dyn EncryptionPropertiesFactory>`. At some point an `Any` would need 
to be downcast to something that Datafusion understands for it to be usable 
right? But I agree we should come up with an example of how we'd like this to 
work and that should provide more clarity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to