Re: Adding Parquet encryption support to PyArrow

Antoine Pitrou Thu, 03 Sep 2020 13:45:23 -0700


It would be useful for outsiders to expose what those two API levels
are, and to what usage they correspond.
Is Parquet encryption used only with that Spark?  While Spark
interoperability is important, Parquet files are more ubiquitous than that.


Regards

Antoine.


Le 03/09/2020 à 22:31, Gidon Gershinsky a écrit :
> Why would the low level API be exposed directly.. This will break the
> interop between the two analytic ecosystems down the road.
> Again, let me suggest leveraging the high level interface, based on the
> PropertiesDrivenCryptoFactory.
> It should address your technical requirements; if it doesn't, we can
> discuss the gaps.
> All questions are welcome.
> 
> Cheers, Gidon
> 
> 
> On Thu, Sep 3, 2020 at 10:11 PM Roee Shlomo <roe...@gmail.com> wrote:
> 
>> Hi Itamar,
>>
>> I implemented some python wrappers for the low level API and would be
>> happy to collaborate on that. The reason I didn't push this forward yet is
>> what Gidon mentioned. The API to expose to python users needs to be
>> finalized first and it must include the key tools API for interop with
>> Spark.
>>
>> Perhaps it would be good to kickoff a discussion on how the pyarrow API
>> for PME should look like (in parallel to reviewing the arrow-cpp
>> implementation of key-tools; to ensure that wrapping it would be a
>> reasonable effort).
>>
>> One possible approach is to expose both the low level API and keytools
>> separately. A user would create and initialize a
>> PropertiesDrivenCryptoFactory and use it to create the
>> FileEncryptionProperties/FileDecryptionProperties to pass to the lower
>> level API. In pandas this would translate to something like:
>> ```
>> factory = PropertiesDrivenCryptoFactory(...)
>> df.to_parquet(path, engine="pyarrow",
>> encryption=factory.getFileEncryptionProperties(...))
>> df = pd.read_parquet(path, engine="pyarrow",
>> decryption=factory.getFileDecryptionProperties(...))
>> ```
>> This should also work with reading datasets since decryption uses a
>> KeyRetriever, but I'm not sure what will need to be done once datasets will
>> support write.
>>
>> On 2020/09/03 14:11:51, "Itamar Turner-Trauring" <ita...@pythonspeed.com>
>> wrote:
>>> Hi,
>>>
>>> I'm looking into implementing this, and it seems like there are two
>> parts: packaging, but also wrapping the APIs in Python. Is the latter item
>> accurate? If so, any examples of similar existing wrapped APIs, or should I
>> just come up with something on my own?
>>>
>>> Context:
>>> https://github.com/apache/arrow/pull/4826
>>> https://issues.apache.org/jira/browse/ARROW-8040
>>>
>>> Thanks,
>>>
>>> —Itamar
>>
>

Re: Adding Parquet encryption support to PyArrow

Reply via email to