Hi, Since the PR for high-level C++ Parquet encryption API appears stalled (https://github.com/apache/arrow/pull/8023), I'm looking into exposing the low-level Parquet encryption API to Python.
Arguments for doing this: the low-level API is all the users I'm talking to need, at the moment, so it's plausible others would also find some benefit in having the Pyarrow API expose low-level Parquet encryption. Then again, it might only be this one company and no one else cares. The arguments against, per Gidon Gershinsky: > * security: low-level encryption API is easy to misuse (eg giving the same > keys for a number of different files; this'd break the AES GCM cipher). The > high-level encryption layer handles that by applying envelope encryption and > other best practices in data security. Also, this layer is maintained by the > community, meaning that future improvements and security fixes can be > upstreamed by anyone, and available to all. > * compatibility: parquet-mr implements the high-level encryption layer. If > we want the files produced by Spark/Presto/etc to be readable by > pandas/PyArrow (and vice versa), we need to provide the Arrow users with the > high-level API. > ... > > The current situation is not ideal, it'd be good to merge the high-level PR > (and maybe hide the low level), but here we are; also, C++ is a kind of a > low-level language; Python would expose it to a less experienced audience. (Source: https://issues.apache.org/jira/browse/ARROW-8040) I find the compatibility argument less compelling, that's readily addressed by documentation. I am not a crypto expert so I can't evaluate how risky exposing the low-level encryption APIs would be, but I can see how that would be a significant concern. Some options are: * Status quo, no Python API for low-level Parquet encryption. This has two possible outcomes: * Eventually high-level API gets merged, gets Python binding. * High-level encryption API is never merged, Python users never get access to encryption. * Add low-level Parquet encryption API to Pyarrow, perhaps using "hazmat" idiom used by the Python cryptography package (API namespace indicating "use at your own risk, this is dangerous", basically, e.g. `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). * Gidon Gershinsky did not find this suggestion compelling enough to override his security concerns. * Low-level encryption done as third party Python package, either private or open source. This is annoying technically, plausibly would require maintaining a fork. Any other ideas? Thoughts on these options? —Itamar