I must say I'm impressed with the level of constructiveness and technical quality in this discussion, we're off to a good start in this project.
*For POC, I think what you conclude is mostly correct, I am currently implementing the encryption spec, general encrypted file stream with KMS API, and I would expect the low level file encryption integration to take place separately and we can meet in the middle. For key rotation and AAD, I think we can discuss more details in the doc first before proceeding forward, they are not blocking tasks anyway.* Sounds good to me on all points. Lets indeed work on our respective parts in the POC, coordinating the design (when needed) via the doc. Once we need a deeper coordination, we might set up a meeting, but I agree this is not pressing, can wait a few weeks. *“there is an intermediate approach, where (the many) DEKs are encrypted with (a few) KEKs, and stored inside manifest files (key_metadata fields) - this can be immutable, as long as the KEKs are encrypted with MEKs and stored in a mutable medium that can be replaced/updated upon MEK rotation.”* *That is doable as we store the KEKs in spec. In that case, a MEK rotation would perform a spec update. But it implies KEK is static just like DEK, and we will only rotate MEK and not rotate KEK. I thought we also need to rotate KEKs that’s why I did not consider this approach. I do not have enough experience in a double-wrap system, but does the security standard still hold in this case without KEK rotation? Or is there a separated process to handle KEK rotation?* It's one of those borderline areas... I have an opinion on this, but to be on the safe side, I'll send a question to the community - maybe there are folks who can shed additional light on the trade-offs in this situation, or can point us to other contacts or sources. Cheers, Gidon On Wed, Mar 24, 2021 at 11:46 PM Ye, Jack <yzhao...@amazon.com.invalid> wrote: > Sounds good, lets continue with some discussions through the doc. For POC, > I think what you conclude is mostly correct, I am currently implementing > the encryption spec, general encrypted file stream with KMS API, and I > would expect the low level file encryption integration to take place > separately and we can meet in the middle. For key rotation and AAD, I think > we can discuss more details in the doc first before proceeding forward, > they are not blocking tasks anyway. > > > > “there is an intermediate approach, where (the many) DEKs are encrypted > with (a few) KEKs, and stored inside manifest files (key_metadata fields) - > this can be immutable, as long as the KEKs are encrypted with MEKs and > stored in a mutable medium that can be replaced/updated upon MEK rotation.” > > > > That is doable as we store the KEKs in spec. In that case, a MEK rotation > would perform a spec update. But it implies KEK is static just like DEK, > and we will only rotate MEK and not rotate KEK. I thought we also need to > rotate KEKs that’s why I did not consider this approach. I do not have > enough experience in a double-wrap system, but does the security standard > still hold in this case without KEK rotation? Or is there a separated > process to handle KEK rotation? > > > > “(1) is a direct DEK passing; we've considered it for Parquet, but decided > against it, because it can lead to unsafe situations” > > > > Nice, I think I also mentioned in the doc that I am against using this > scheme, so we can focus more on supporting the single and double wrapping > use case. > > > > -Jack > > > > > > *From: *Gidon Gershinsky <gg5...@gmail.com> > *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org> > *Date: *Wednesday, March 24, 2021 at 05:19 > *To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org> > *Subject: *RE: [EXTERNAL] Extending Apache Iceberg Encryption Module > > > > *CAUTION*: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > Sounds good, thanks. > > Responding to the points below: > > > > *"we can choose to store the encrypted DEKs inside the manifest or as a > separated instruction file with a pointer in key_metadata, and there are > tradeoffs for those approaches"* > > > > For the latter, we are running a similar mechanism in Parquet encryption, > where we keep the key material in separate json files, and a pointer to it > inside the parquet file footer key_metadata fields. This works; but for > Iceberg integration, there are advantages in using the manifest files (or > other managed medium) instead. The trade-offs (inc size additions, > consistency, management) TBD. > > Btw, there is a intermediate approach, where (the many) DEKs are encrypted > with (a few) KEKs, and stored inside manifest files (key_metadata fields) - > this can be immutable, as long as the KEKs are encrypted with MEKs and > stored in a mutable medium that can be replaced/updated upon MEK rotation. > > > > *"3 common cases: (1) direct DEK ID, (2) KEK ID + encrypted DEK, (3) MEK > ID + encrypted KEK + encrypted DEK, and that should be enough to cover most > of the use cases with different types of KMS"* > > > > Yep, (2) and (3) are the single and double wrapping, respectively, which > covers our usecases; (1) is a direct DEK passing; we've considered it for > Parquet, but decided against it, because it can lead to unsafe situations > where an inexperienced user will pass the same DEK to many files (which can > break the GCM cipher, even with one table). But we might try to enable it > in Iceberg with strong preventive measures (if possible), TBD. > > > > *"DDL clauses for encryption and key rotation* > > *These definitely make sense to me. I will add a list of the DDL clauses I > was thinking about to the doc.* > > *Cryptographic integrity of Data Tables* > > *Yes, I think in this doc at least the location and structure of AAD > prefix should be discussed, so hopefully we can reach some general > consensus for integrity support for Iceberg tables and make sure the right > information is in place or can be added later."* > > > > SGTM. > > > > *"I am also working on a POC to flush out some details for the aspects > described in the doc, I will update in this thread once I publish that."* > > > > We too work on a POC of this technology. I guess we're working at > different corners at the moment, as we're mostly focused on Parquet > encryption integration, parts of key rotation and on GCM streams with AAD > Prefixes for table integrity; while you probably are working on the Catalog > metadata, general encrypted file streams and key management API. But since > there is a high potential for overlaps, I'd suggest we'd coordinate the POC > work; what would be the best way of doing that? > > > > Cheers, Gidon > > > > > > On Tue, Mar 23, 2021 at 11:50 PM Jack Ye <yezhao...@gmail.com> wrote: > > Thanks for the feedback to the doc, we are also closely following the > Parquet encryption work and would like to have that in Iceberg as soon as > possible with the right architecture. Here are some brief thoughts for the > points you mentioned in the email, I will add more details in the google > doc: > > > > *Key rotation* > > My initial thought was to consider key rotation as a separated process and > DEK rewrapping can be done with a Spark stored procedure, that's why I did > not add any detail for it. But your point about the work needed to rewrite > and clean up manifests is a really good point that I should fully describe > the details. > > For instance, we can choose to store the encrypted DEKs inside the > manifest or as a separated instruction file with a pointer in key_metadata, > and there are tradeoffs for those approaches. I will update the doc for > these details. > > > > *Acceleration of KMS interactions* > > Thanks for bringing up double wrapping, I was hesitant to mention that in > the initial version of the doc because it would add complexity for > understanding the overall architecture. And for the use cases I have seen > with AWS KMS, people are all using single-wrapping and the service was able > to handle generation of millions of DEKs, and it seems like there was no > complaint about it. > > I think the right way to go is to support the 3 common cases: (1) direct > DEK ID, (2) KEK ID + encrypted DEK, (3) MEK ID + encrypted KEK + encrypted > DEK, and that should be enough to cover most of the use cases with > different types of KMS. I will update the encryption spec with more details > on that. > > > > *DDL clauses for encryption and key rotation* > > These definitely make sense to me. I will add a list of the DDL clauses I > was thinking about to the doc. > > > > *Cryptographic integrity of Data Tables* > > Yes, I think in this doc at least the location and structure of AAD prefix > should be discussed, so hopefully we can reach some general consensus for > integrity support for Iceberg tables and make sure the right information is > in place or can be added later. > > > > I am also working on a POC to flush out some details for the aspects > described in the doc, I will update in this thread once I publish that. > > > > Best, > > Jack Ye > > > > On Tue, Mar 23, 2021 at 5:04 AM Gidon Gershinsky <gg5...@gmail.com> wrote: > > Hi Jack, > > > > We're working on Parquet encryption, which is about to be released in the > upcoming parquet-mr-1.12 version. Recently, we've started to look into its > integration in Iceberg. It became immediately clear we need to take a wider > view that covers other types of encryption in Iceberg (file streams and > ORC); otherwise, we'd end up with a number of silos. > > At the time, there was no top-down design for data encryption in Iceberg, > so we've started to tinker with it. But now we can base this on your > document. I really liked it, a solid foundation. > > > > There are a number of high-level concepts I believe we'd need to add there: > > > > - Key rotation in Iceberg > > (Not just in KMS). The envelope encryption practice requires periodic (or > on-demand) re-wrapping of DEKs with new versions of master keys. KMS > generates the new versions, and keeps the master key history, but the > re-wrapped DEKs need to be updated in Iceberg metadata. If key_metadata is > kept in manifest files, this means all manifest files must be deleted > (because they keep DEKs wrapped with the previous master key version, which > is not safe anymore), and created again with the updated key_metadata > field. We've quickly discussed this with Anton, seems to be feasible, but > there are other alternatives. We need to decide if manifests are the right > place to store all key_metadata; and to design a mechanism (potentially > with a DDL clause) to perform the rotation operation. > > > > - Acceleration of KMS interactions > > KMSs can be very slow, especially when backed by HSMs. Per the doc, "The > KEK is stored in a key management service (KMS) to control access and key > rotation." We should not fetch secret keys from KMS, because this exposes > them; instead, many KMSs allow to wrap/encrypt DEKs inside the KMS server, > without ever exposing the master keys. But since we have to generate a DEK > per file/column, we'll end up with many KMS wrap calls when writing the > data (and many unwrap calls when reading the data). That's why Parquet > encryption uses a concept of double wrapping, where DEKs are wrapped with > KEKs, and KEKs are wrapped with master keys (MEKs). Only MEKs are > stored/managed inside KMS. > > > > - DDL clauses for encryption and key rotation, such as > > ALTER TABLE .. KEY_ROTATION (params) > > ALTER TABLE .. ENCRYPT (params): encrypts existing table (with plaintext > files) - Russell's proposal > > CREATE TABLE ... ENCRYPTION (params) ; or simply use the TBLPROPERTIES > > Btw, we can re-use the joint ORC/Parquet column encryption parameter > format, defined in this jira discussion started by Xinli - > > https://issues.apache.org/jira/browse/HIVE-21848 > > > > - Cryptographic integrity of Data Tables > > Besides protecting data confidentiality, we need to protect its integrity > against tampering attacks. This one is a longer term work item, based on > these tickets: > > https://github.com/apache/iceberg/issues/44, > https://github.com/apache/iceberg/issues/2060, > https://github.com/apache/iceberg/issues/2073 > > We'll work on these at a later stage, after the confidentiality basis is > ready; but we need to make sure the current work on confidentiality enables > (or at least doesn't block) the future integrity work. For example, we can > start using https://github.com/apache/iceberg/issues/2060 sooner rather > than later, for encrypting the Iceberg metadata files and Avro data files. > > > > That was a high level description, I'll add detailed comments inside the > design googledoc. > > > > Cheers, Gidon > > > > > > On Mon, Mar 22, 2021 at 7:25 PM Jack Ye <yezhao...@gmail.com> wrote: > > Hi everyone, > > > > To continue the discussion in the last sync meeting about encryption in > Iceberg, here is the document for a proposal: > > > > > https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing > > > > Would be very appreciated for any feedback. > > > > Best, > > Jack Ye > >