Hi Gidon,

Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
> Regarding the high-level layer, I think it waits for a progress at
> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
> No activity there since last November. This is unfortunate, because Tham
> has put a lot of work in coding the high-level layer (and addressing 200+
> review comments) in the PR https://github.com/apache/arrow/pull/8023. The
> code is functional, compatible with the Java version in parquet-mr, and can
> be updated with the threading changes in the doc above. I hope all this
> good work will not be wasted.

I'm sorry for the possibly frustrating process.  Looking at the PR,
though, it seems a bunch of comments were not addressed.  Is it possible
to go through them and ensure they get an answer and/or a resolution?

Best regards

Antoine.



> 
> Cheers, Gidon
> 
> 
> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> 
>> My thoughts:
>> 1.  I've lost track of the higher level encryption implementation in C++.
>> I think we were trying to come to a consensus on the threading/thread
>> safety model?
>>
>> 2.  I'm open to exposing the lower level encryption libraries in python
>> (without appropriate namespacing/communication).  It seems at least for
>> reading, there is potentially less harm (I'll caveat that with I'm not a
>> security expert).  Are both the low level read and write implementations
>> necessary?  (it probably makes sense to have a few smaller PRs for exposing
>> this functionality anyways).
>>
>>
>>
>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
>> ita...@pythonspeed.com> wrote:
>>
>>> Hi,
>>>
>>> Since the PR for high-level C++ Parquet encryption API appears stalled (
>>> https://github.com/apache/arrow/pull/8023), I'm looking into exposing
>> the
>>> low-level Parquet encryption API to Python.
>>>
>>> Arguments for doing this: the low-level API is all the users I'm talking
>>> to need, at the moment, so it's plausible others would also find some
>>> benefit in having the Pyarrow API expose low-level Parquet encryption.
>> Then
>>> again, it might only be this one company and no one else cares.
>>>
>>> The arguments against, per Gidon Gershinsky:
>>>
>>>>  * security: low-level encryption API is easy to misuse (eg giving the
>>> same keys for a number of different files; this'd break the AES GCM
>>> cipher). The high-level encryption layer handles that by applying
>> envelope
>>> encryption and other best practices in data security. Also, this layer is
>>> maintained by the community, meaning that future improvements and
>> security
>>> fixes can be upstreamed by anyone, and available to all.
>>>>  * compatibility: parquet-mr implements the high-level encryption
>> layer.
>>> If we want the files produced by Spark/Presto/etc to be readable by
>>> pandas/PyArrow (and vice versa), we need to provide the Arrow users with
>>> the high-level API.
>>>> ...
>>>>
>>>> The current situation is not ideal, it'd be good to merge the
>> high-level
>>> PR (and maybe hide the low level), but here we are; also, C++ is a kind
>> of
>>> a low-level language; Python would expose it to a less experienced
>> audience.
>>>
>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040)
>>>
>>> I find the compatibility argument less compelling, that's readily
>>> addressed by documentation. I am not a crypto expert so I can't evaluate
>>> how risky exposing the low-level encryption APIs would be, but I can see
>>> how that would be a significant concern.
>>>
>>> Some options are:
>>>  * Status quo, no Python API for low-level Parquet encryption. This has
>>> two possible outcomes:
>>>    * Eventually high-level API gets merged, gets Python binding.
>>>    * High-level encryption API is never merged, Python users never get
>>> access to encryption.
>>>  * Add low-level Parquet encryption API to Pyarrow, perhaps using
>> "hazmat"
>>> idiom used by the Python cryptography package (API namespace indicating
>>> "use at your own risk, this is dangerous", basically, e.g.
>>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
>>>    * Gidon Gershinsky did not find this suggestion compelling enough to
>>> override his security concerns.
>>>  * Low-level encryption done as third party Python package, either
>> private
>>> or open source. This is annoying technically, plausibly would require
>>> maintaining a fork.
>>> Any other ideas? Thoughts on these options?
>>>
>>> —Itamar
>>
> 

Reply via email to