Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :
> From the doc,
> "To maintain consistency with the style of parquet-cpp, the above
> structures should not be explicitly synchronized with individual mutexes.
> In the case of a parquet::arrow::FileReader, the request to read a given
> selection of row groups and columns is issued from a single main thread.
> Note that this does require that all keys required for a read are assembled
> on the main thread so that DecryptionKeyRetriever objects are not directly
> accessing any caches"
> 
> The current PR code doesn't require a single main thread. Any thread can
> read any file, both footer and pages. So the key cache is shared, to save
> N-1 interactions with the KMS server.

I don't think there's any contention on this.  IMHO the only concerns
are about the implementation, not the semantics.

Best regards

Antoine.


> 
> Cheers, Gidon
> 
> 
> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <anto...@python.org> wrote:
> 
>>
>> I'm not sure a threading model is expected for an encryption layer.  Am
>> I missing something?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
>>> Precisely, the main change is in the threading model. Afaik, the document
>>> proposes a model that fits pandas, but might be problematic for other
>> users
>>> of this library.
>>> Technically, this is not showstopper though; if the community decides on
>>> this model, it will be compatible with the high-level encryption design;
>>> but the change implementation would need to be done by pandas experts
>> (not
>>> us; but we'll help where we can).
>>> Micah, you know this subject (and the community) better than we do - we'd
>>> much appreciate it if you'd take a lead on removing this roadblock.
>>>
>>> Cheers, Gidon
>>>
>>>
>>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> I think some of the comments might be conflicting.  One of the concerns
>>>> (that I would need to refresh myself on to offer an opinion which was
>>>> covered in Ben's doc) was the threading model we expect in the library.
>>>>
>>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <anto...@python.org>
>> wrote:
>>>>
>>>>>
>>>>> Hi Gidon,
>>>>>
>>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
>>>>>> Regarding the high-level layer, I think it waits for a progress at
>>>>>>
>>>>>
>>>>
>> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
>>>>>> No activity there since last November. This is unfortunate, because
>>>> Tham
>>>>>> has put a lot of work in coding the high-level layer (and addressing
>>>> 200+
>>>>>> review comments) in the PR https://github.com/apache/arrow/pull/8023.
>>>>> The
>>>>>> code is functional, compatible with the Java version in parquet-mr,
>> and
>>>>> can
>>>>>> be updated with the threading changes in the doc above. I hope all
>> this
>>>>>> good work will not be wasted.
>>>>>
>>>>> I'm sorry for the possibly frustrating process.  Looking at the PR,
>>>>> though, it seems a bunch of comments were not addressed.  Is it
>> possible
>>>>> to go through them and ensure they get an answer and/or a resolution?
>>>>>
>>>>> Best regards
>>>>>
>>>>> Antoine.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Cheers, Gidon
>>>>>>
>>>>>>
>>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <
>> emkornfi...@gmail.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> My thoughts:
>>>>>>> 1.  I've lost track of the higher level encryption implementation in
>>>>> C++.
>>>>>>> I think we were trying to come to a consensus on the threading/thread
>>>>>>> safety model?
>>>>>>>
>>>>>>> 2.  I'm open to exposing the lower level encryption libraries in
>>>> python
>>>>>>> (without appropriate namespacing/communication).  It seems at least
>>>> for
>>>>>>> reading, there is potentially less harm (I'll caveat that with I'm
>>>> not a
>>>>>>> security expert).  Are both the low level read and write
>>>> implementations
>>>>>>> necessary?  (it probably makes sense to have a few smaller PRs for
>>>>> exposing
>>>>>>> this functionality anyways).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
>>>>>>> ita...@pythonspeed.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Since the PR for high-level C++ Parquet encryption API appears
>>>> stalled
>>>>> (
>>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking into
>>>> exposing
>>>>>>> the
>>>>>>>> low-level Parquet encryption API to Python.
>>>>>>>>
>>>>>>>> Arguments for doing this: the low-level API is all the users I'm
>>>>> talking
>>>>>>>> to need, at the moment, so it's plausible others would also find
>> some
>>>>>>>> benefit in having the Pyarrow API expose low-level Parquet
>>>> encryption.
>>>>>>> Then
>>>>>>>> again, it might only be this one company and no one else cares.
>>>>>>>>
>>>>>>>> The arguments against, per Gidon Gershinsky:
>>>>>>>>
>>>>>>>>>  * security: low-level encryption API is easy to misuse (eg giving
>>>> the
>>>>>>>> same keys for a number of different files; this'd break the AES GCM
>>>>>>>> cipher). The high-level encryption layer handles that by applying
>>>>>>> envelope
>>>>>>>> encryption and other best practices in data security. Also, this
>>>> layer
>>>>> is
>>>>>>>> maintained by the community, meaning that future improvements and
>>>>>>> security
>>>>>>>> fixes can be upstreamed by anyone, and available to all.
>>>>>>>>>  * compatibility: parquet-mr implements the high-level encryption
>>>>>>> layer.
>>>>>>>> If we want the files produced by Spark/Presto/etc to be readable by
>>>>>>>> pandas/PyArrow (and vice versa), we need to provide the Arrow users
>>>>> with
>>>>>>>> the high-level API.
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> The current situation is not ideal, it'd be good to merge the
>>>>>>> high-level
>>>>>>>> PR (and maybe hide the low level), but here we are; also, C++ is a
>>>> kind
>>>>>>> of
>>>>>>>> a low-level language; Python would expose it to a less experienced
>>>>>>> audience.
>>>>>>>>
>>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040)
>>>>>>>>
>>>>>>>> I find the compatibility argument less compelling, that's readily
>>>>>>>> addressed by documentation. I am not a crypto expert so I can't
>>>>> evaluate
>>>>>>>> how risky exposing the low-level encryption APIs would be, but I can
>>>>> see
>>>>>>>> how that would be a significant concern.
>>>>>>>>
>>>>>>>> Some options are:
>>>>>>>>  * Status quo, no Python API for low-level Parquet encryption. This
>>>> has
>>>>>>>> two possible outcomes:
>>>>>>>>    * Eventually high-level API gets merged, gets Python binding.
>>>>>>>>    * High-level encryption API is never merged, Python users never
>>>> get
>>>>>>>> access to encryption.
>>>>>>>>  * Add low-level Parquet encryption API to Pyarrow, perhaps using
>>>>>>> "hazmat"
>>>>>>>> idiom used by the Python cryptography package (API namespace
>>>> indicating
>>>>>>>> "use at your own risk, this is dangerous", basically, e.g.
>>>>>>>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
>>>>>>>>    * Gidon Gershinsky did not find this suggestion compelling enough
>>>> to
>>>>>>>> override his security concerns.
>>>>>>>>  * Low-level encryption done as third party Python package, either
>>>>>>> private
>>>>>>>> or open source. This is annoying technically, plausibly would
>> require
>>>>>>>> maintaining a fork.
>>>>>>>> Any other ideas? Thoughts on these options?
>>>>>>>>
>>>>>>>> —Itamar
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 

Reply via email to