Hi Gidon,
Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit : > Regarding the high-level layer, I think it waits for a progress at > https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing > No activity there since last November. This is unfortunate, because Tham > has put a lot of work in coding the high-level layer (and addressing 200+ > review comments) in the PR https://github.com/apache/arrow/pull/8023. The > code is functional, compatible with the Java version in parquet-mr, and can > be updated with the threading changes in the doc above. I hope all this > good work will not be wasted. I'm sorry for the possibly frustrating process. Looking at the PR, though, it seems a bunch of comments were not addressed. Is it possible to go through them and ensure they get an answer and/or a resolution? Best regards Antoine. > > Cheers, Gidon > > > On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> My thoughts: >> 1. I've lost track of the higher level encryption implementation in C++. >> I think we were trying to come to a consensus on the threading/thread >> safety model? >> >> 2. I'm open to exposing the lower level encryption libraries in python >> (without appropriate namespacing/communication). It seems at least for >> reading, there is potentially less harm (I'll caveat that with I'm not a >> security expert). Are both the low level read and write implementations >> necessary? (it probably makes sense to have a few smaller PRs for exposing >> this functionality anyways). >> >> >> >> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring < >> ita...@pythonspeed.com> wrote: >> >>> Hi, >>> >>> Since the PR for high-level C++ Parquet encryption API appears stalled ( >>> https://github.com/apache/arrow/pull/8023), I'm looking into exposing >> the >>> low-level Parquet encryption API to Python. >>> >>> Arguments for doing this: the low-level API is all the users I'm talking >>> to need, at the moment, so it's plausible others would also find some >>> benefit in having the Pyarrow API expose low-level Parquet encryption. >> Then >>> again, it might only be this one company and no one else cares. >>> >>> The arguments against, per Gidon Gershinsky: >>> >>>> * security: low-level encryption API is easy to misuse (eg giving the >>> same keys for a number of different files; this'd break the AES GCM >>> cipher). The high-level encryption layer handles that by applying >> envelope >>> encryption and other best practices in data security. Also, this layer is >>> maintained by the community, meaning that future improvements and >> security >>> fixes can be upstreamed by anyone, and available to all. >>>> * compatibility: parquet-mr implements the high-level encryption >> layer. >>> If we want the files produced by Spark/Presto/etc to be readable by >>> pandas/PyArrow (and vice versa), we need to provide the Arrow users with >>> the high-level API. >>>> ... >>>> >>>> The current situation is not ideal, it'd be good to merge the >> high-level >>> PR (and maybe hide the low level), but here we are; also, C++ is a kind >> of >>> a low-level language; Python would expose it to a less experienced >> audience. >>> >>> (Source: https://issues.apache.org/jira/browse/ARROW-8040) >>> >>> I find the compatibility argument less compelling, that's readily >>> addressed by documentation. I am not a crypto expert so I can't evaluate >>> how risky exposing the low-level encryption APIs would be, but I can see >>> how that would be a significant concern. >>> >>> Some options are: >>> * Status quo, no Python API for low-level Parquet encryption. This has >>> two possible outcomes: >>> * Eventually high-level API gets merged, gets Python binding. >>> * High-level encryption API is never merged, Python users never get >>> access to encryption. >>> * Add low-level Parquet encryption API to Pyarrow, perhaps using >> "hazmat" >>> idiom used by the Python cryptography package (API namespace indicating >>> "use at your own risk, this is dangerous", basically, e.g. >>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). >>> * Gidon Gershinsky did not find this suggestion compelling enough to >>> override his security concerns. >>> * Low-level encryption done as third party Python package, either >> private >>> or open source. This is annoying technically, plausibly would require >>> maintaining a fork. >>> Any other ideas? Thoughts on these options? >>> >>> —Itamar >> >