Hi,

Since the PR for high-level C++ Parquet encryption API appears stalled 
(https://github.com/apache/arrow/pull/8023), I'm looking into exposing the 
low-level Parquet encryption API to Python.

Arguments for doing this: the low-level API is all the users I'm talking to 
need, at the moment, so it's plausible others would also find some benefit in 
having the Pyarrow API expose low-level Parquet encryption. Then again, it 
might only be this one company and no one else cares.

The arguments against, per Gidon Gershinsky:

>  * security: low-level encryption API is easy to misuse (eg giving the same 
> keys for a number of different files; this'd break the AES GCM cipher). The 
> high-level encryption layer handles that by applying envelope encryption and 
> other best practices in data security. Also, this layer is maintained by the 
> community, meaning that future improvements and security fixes can be 
> upstreamed by anyone, and available to all.
>  * compatibility: parquet-mr implements the high-level encryption layer. If 
> we want the files produced by Spark/Presto/etc to be readable by 
> pandas/PyArrow (and vice versa), we need to provide the Arrow users with the 
> high-level API. 
> ...
> 
> The current situation is not ideal, it'd be good to merge the high-level PR 
> (and maybe hide the low level), but here we are; also, C++ is a kind of a 
> low-level language; Python would expose it to a less experienced audience.

(Source: https://issues.apache.org/jira/browse/ARROW-8040)

I find the compatibility argument less compelling, that's readily addressed by 
documentation. I am not a crypto expert so I can't evaluate how risky exposing 
the low-level encryption APIs would be, but I can see how that would be a 
significant concern.

Some options are:
 * Status quo, no Python API for low-level Parquet encryption. This has two 
possible outcomes:
   * Eventually high-level API gets merged, gets Python binding.
   * High-level encryption API is never merged, Python users never get access 
to encryption.
 * Add low-level Parquet encryption API to Pyarrow, perhaps using "hazmat" 
idiom used by the Python cryptography package (API namespace indicating "use at 
your own risk, this is dangerous", basically, e.g. 
`cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
   * Gidon Gershinsky did not find this suggestion compelling enough to 
override his security concerns.
 * Low-level encryption done as third party Python package, either private or 
open source. This is annoying technically, plausibly would require maintaining 
a fork.
Any other ideas? Thoughts on these options? 

—Itamar

Reply via email to