My thoughts:
1.  I've lost track of the higher level encryption implementation in C++.
I think we were trying to come to a consensus on the threading/thread
safety model?

2.  I'm open to exposing the lower level encryption libraries in python
(without appropriate namespacing/communication).  It seems at least for
reading, there is potentially less harm (I'll caveat that with I'm not a
security expert).  Are both the low level read and write implementations
necessary?  (it probably makes sense to have a few smaller PRs for exposing
this functionality anyways).



On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
ita...@pythonspeed.com> wrote:

> Hi,
>
> Since the PR for high-level C++ Parquet encryption API appears stalled (
> https://github.com/apache/arrow/pull/8023), I'm looking into exposing the
> low-level Parquet encryption API to Python.
>
> Arguments for doing this: the low-level API is all the users I'm talking
> to need, at the moment, so it's plausible others would also find some
> benefit in having the Pyarrow API expose low-level Parquet encryption. Then
> again, it might only be this one company and no one else cares.
>
> The arguments against, per Gidon Gershinsky:
>
> >  * security: low-level encryption API is easy to misuse (eg giving the
> same keys for a number of different files; this'd break the AES GCM
> cipher). The high-level encryption layer handles that by applying envelope
> encryption and other best practices in data security. Also, this layer is
> maintained by the community, meaning that future improvements and security
> fixes can be upstreamed by anyone, and available to all.
> >  * compatibility: parquet-mr implements the high-level encryption layer.
> If we want the files produced by Spark/Presto/etc to be readable by
> pandas/PyArrow (and vice versa), we need to provide the Arrow users with
> the high-level API.
> > ...
> >
> > The current situation is not ideal, it'd be good to merge the high-level
> PR (and maybe hide the low level), but here we are; also, C++ is a kind of
> a low-level language; Python would expose it to a less experienced audience.
>
> (Source: https://issues.apache.org/jira/browse/ARROW-8040)
>
> I find the compatibility argument less compelling, that's readily
> addressed by documentation. I am not a crypto expert so I can't evaluate
> how risky exposing the low-level encryption APIs would be, but I can see
> how that would be a significant concern.
>
> Some options are:
>  * Status quo, no Python API for low-level Parquet encryption. This has
> two possible outcomes:
>    * Eventually high-level API gets merged, gets Python binding.
>    * High-level encryption API is never merged, Python users never get
> access to encryption.
>  * Add low-level Parquet encryption API to Pyarrow, perhaps using "hazmat"
> idiom used by the Python cryptography package (API namespace indicating
> "use at your own risk, this is dangerous", basically, e.g.
> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
>    * Gidon Gershinsky did not find this suggestion compelling enough to
> override his security concerns.
>  * Low-level encryption done as third party Python package, either private
> or open source. This is annoying technically, plausibly would require
> maintaining a fork.
> Any other ideas? Thoughts on these options?
>
> —Itamar

Reply via email to