I must say I'm impressed with the level of constructiveness and
technical quality in this discussion, we're off to a good start in this
project.

*For POC, I think what you conclude is mostly correct, I am currently
implementing the encryption spec, general encrypted file stream with KMS
API, and I would expect the low level file encryption integration to take
place separately and we can meet in the middle. For key rotation and AAD, I
think we can discuss more details in the doc first before proceeding
forward, they are not blocking tasks anyway.*

Sounds good to me on all points. Lets indeed work on our respective parts
in the POC, coordinating the design (when needed) via the doc. Once we need
a deeper coordination, we might set up a meeting, but I agree this is not
pressing, can wait a few weeks.

*“there is an intermediate approach, where (the many) DEKs are encrypted
with (a few) KEKs, and stored inside manifest files (key_metadata fields) -
this can be immutable, as long as the KEKs are encrypted with MEKs and
stored in a mutable medium that can be replaced/updated upon MEK rotation.”*

*That is doable as we store the KEKs in spec. In that case, a MEK rotation
would perform a spec update. But it implies KEK is static just like DEK,
and we will only rotate MEK and not rotate KEK. I thought we also need to
rotate KEKs that’s why I did not consider this approach. I do not have
enough experience in a double-wrap system, but does the security standard
still hold in this case without KEK rotation? Or is there a separated
process to handle KEK rotation?*

It's one of those borderline areas... I have an opinion on this, but to be
on the safe side, I'll send a question to the community - maybe there are
folks who can shed additional light on the trade-offs in this situation, or
can point us to other contacts or sources.

Cheers, Gidon


On Wed, Mar 24, 2021 at 11:46 PM Ye, Jack <yzhao...@amazon.com.invalid>
wrote:

> Sounds good, lets continue with some discussions through the doc. For POC,
> I think what you conclude is mostly correct, I am currently implementing
> the encryption spec, general encrypted file stream with KMS API, and I
> would expect the low level file encryption integration to take place
> separately and we can meet in the middle. For key rotation and AAD, I think
> we can discuss more details in the doc first before proceeding forward,
> they are not blocking tasks anyway.
>
>
>
> “there is an intermediate approach, where (the many) DEKs are encrypted
> with (a few) KEKs, and stored inside manifest files (key_metadata fields) -
> this can be immutable, as long as the KEKs are encrypted with MEKs and
> stored in a mutable medium that can be replaced/updated upon MEK rotation.”
>
>
>
> That is doable as we store the KEKs in spec. In that case, a MEK rotation
> would perform a spec update. But it implies KEK is static just like DEK,
> and we will only rotate MEK and not rotate KEK. I thought we also need to
> rotate KEKs that’s why I did not consider this approach. I do not have
> enough experience in a double-wrap system, but does the security standard
> still hold in this case without KEK rotation? Or is there a separated
> process to handle KEK rotation?
>
>
>
> “(1) is a direct DEK passing; we've considered it for Parquet, but decided
> against it, because it can lead to unsafe situations”
>
>
>
> Nice, I think I also mentioned in the doc that I am against using this
> scheme, so we can focus more on supporting the single and double wrapping
> use case.
>
>
>
> -Jack
>
>
>
>
>
> *From: *Gidon Gershinsky <gg5...@gmail.com>
> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>
> *Date: *Wednesday, March 24, 2021 at 05:19
> *To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>
> *Subject: *RE: [EXTERNAL] Extending Apache Iceberg Encryption Module
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Sounds good, thanks.
>
> Responding to the points below:
>
>
>
> *"we can choose to store the encrypted DEKs inside the manifest or as a
> separated instruction file with a pointer in key_metadata, and there are
> tradeoffs for those approaches"*
>
>
>
> For the latter, we are running a similar mechanism in Parquet encryption,
> where we keep the key material in separate json files, and a pointer to it
> inside the parquet file footer key_metadata fields. This works; but for
> Iceberg integration, there are advantages in using the manifest files (or
> other managed medium) instead. The trade-offs (inc size additions,
> consistency, management) TBD.
>
> Btw, there is a intermediate approach, where (the many) DEKs are encrypted
> with (a few) KEKs, and stored inside manifest files (key_metadata fields) -
> this can be immutable, as long as the KEKs are encrypted with MEKs and
> stored in a mutable medium that can be replaced/updated upon MEK rotation.
>
>
>
> *"3 common cases: (1) direct DEK ID, (2) KEK ID + encrypted DEK, (3) MEK
> ID + encrypted KEK + encrypted DEK, and that should be enough to cover most
> of the use cases with different types of KMS"*
>
>
>
> Yep, (2) and (3) are the single and double wrapping, respectively, which
> covers our usecases; (1) is a direct DEK passing; we've considered it for
> Parquet, but decided against it, because it can lead to unsafe situations
> where an inexperienced user will pass the same DEK to many files (which can
> break the GCM cipher, even with one table). But we might try to enable it
> in Iceberg with strong preventive measures (if possible), TBD.
>
>
>
> *"DDL clauses for encryption and key rotation*
>
> *These definitely make sense to me. I will add a list of the DDL clauses I
> was thinking about to the doc.*
>
> *Cryptographic integrity of Data Tables*
>
> *Yes, I think in this doc at least the location and structure of AAD
> prefix should be discussed, so hopefully we can reach some general
> consensus for integrity support for Iceberg tables and make sure the right
> information is in place or can be added later."*
>
>
>
> SGTM.
>
>
>
> *"I am also working on a POC to flush out some details for the aspects
> described in the doc, I will update in this thread once I publish that."*
>
>
>
> We too work on a POC of this technology. I guess we're working at
> different corners at the moment, as we're mostly focused on Parquet
> encryption integration, parts of key rotation and on GCM streams with AAD
> Prefixes for table integrity; while you probably are working on the Catalog
> metadata, general encrypted file streams and key management API. But since
> there is a high potential for overlaps, I'd suggest we'd coordinate the POC
> work; what would be the best way of doing that?
>
>
>
> Cheers, Gidon
>
>
>
>
>
> On Tue, Mar 23, 2021 at 11:50 PM Jack Ye <yezhao...@gmail.com> wrote:
>
> Thanks for the feedback to the doc, we are also closely following the
> Parquet encryption work and would like to have that in Iceberg as soon as
> possible with the right architecture. Here are some brief thoughts for the
> points you mentioned in the email, I will add more details in the google
> doc:
>
>
>
> *Key rotation*
>
> My initial thought was to consider key rotation as a separated process and
> DEK rewrapping can be done with a Spark stored procedure, that's why I did
> not add any detail for it. But your point about the work needed to rewrite
> and clean up manifests is a really good point that I should fully describe
> the details.
>
> For instance, we can choose to store the encrypted DEKs inside the
> manifest or as a separated instruction file with a pointer in key_metadata,
> and there are tradeoffs for those approaches. I will update the doc for
> these details.
>
>
>
> *Acceleration of KMS interactions*
>
> Thanks for bringing up double wrapping, I was hesitant to mention that in
> the initial version of the doc because it would add complexity for
> understanding the overall architecture. And for the use cases I have seen
> with AWS KMS, people are all using single-wrapping and the service was able
> to handle generation of millions of DEKs, and it seems like there was no
> complaint about it.
>
> I think the right way to go is to support the 3 common cases: (1) direct
> DEK ID, (2) KEK ID + encrypted DEK, (3) MEK ID + encrypted KEK + encrypted
> DEK, and that should be enough to cover most of the use cases with
> different types of KMS. I will update the encryption spec with more details
> on that.
>
>
>
> *DDL clauses for encryption and key rotation*
>
> These definitely make sense to me. I will add a list of the DDL clauses I
> was thinking about to the doc.
>
>
>
> *Cryptographic integrity of Data Tables*
>
> Yes, I think in this doc at least the location and structure of AAD prefix
> should be discussed, so hopefully we can reach some general consensus for
> integrity support for Iceberg tables and make sure the right information is
> in place or can be added later.
>
>
>
> I am also working on a POC to flush out some details for the aspects
> described in the doc, I will update in this thread once I publish that.
>
>
>
> Best,
>
> Jack Ye
>
>
>
> On Tue, Mar 23, 2021 at 5:04 AM Gidon Gershinsky <gg5...@gmail.com> wrote:
>
> Hi Jack,
>
>
>
> We're working on Parquet encryption, which is about to be released in the
> upcoming parquet-mr-1.12 version. Recently, we've started to look into its
> integration in Iceberg. It became immediately clear we need to take a wider
> view that covers other types of encryption in Iceberg (file streams and
> ORC); otherwise, we'd end up with a number of silos.
>
> At the time, there was no top-down design for data encryption in Iceberg,
> so we've started to tinker with it. But now we can base this on your
> document. I really liked it, a solid foundation.
>
>
>
> There are a number of high-level concepts I believe we'd need to add there:
>
>
>
> - Key rotation in Iceberg
>
> (Not just in KMS). The envelope encryption practice requires periodic (or
> on-demand) re-wrapping of DEKs with new versions of master keys. KMS
> generates the new versions, and keeps the master key history, but the
> re-wrapped DEKs need to be updated in Iceberg metadata. If key_metadata is
> kept in manifest files, this means all manifest files must be deleted
> (because they keep DEKs wrapped with the previous master key version, which
> is not safe anymore), and created again with the updated key_metadata
> field. We've quickly discussed this with Anton, seems to be feasible, but
> there are other alternatives. We need to decide if manifests are the right
> place to store all key_metadata; and to design a mechanism (potentially
> with a DDL clause) to perform the rotation operation.
>
>
>
> - Acceleration of KMS interactions
>
> KMSs can be very slow, especially when backed by HSMs. Per the doc, "The
> KEK is stored in a key management service (KMS) to control access and key
> rotation." We should not fetch secret keys from KMS, because this exposes
> them; instead, many KMSs allow to wrap/encrypt DEKs inside the KMS server,
> without ever exposing the master keys. But since we have to generate a DEK
> per file/column, we'll end up with many KMS wrap calls when writing the
> data (and many unwrap calls when reading the data). That's why Parquet
> encryption uses a concept of double wrapping, where DEKs are wrapped with
> KEKs, and KEKs are wrapped with master keys (MEKs). Only MEKs are
> stored/managed inside KMS.
>
>
>
> - DDL clauses for encryption and key rotation, such as
>
> ALTER TABLE .. KEY_ROTATION (params)
>
> ALTER TABLE .. ENCRYPT (params): encrypts existing table (with plaintext
> files) - Russell's proposal
>
> CREATE TABLE ... ENCRYPTION (params) ; or simply use the TBLPROPERTIES
>
> Btw, we can re-use the joint ORC/Parquet column encryption parameter
> format, defined in this jira discussion started by Xinli -
>
> https://issues.apache.org/jira/browse/HIVE-21848
>
>
>
> - Cryptographic integrity of Data Tables
>
> Besides protecting data confidentiality, we need to protect its integrity
> against tampering attacks. This one is a longer term work item, based on
> these tickets:
>
> https://github.com/apache/iceberg/issues/44,
> https://github.com/apache/iceberg/issues/2060,
> https://github.com/apache/iceberg/issues/2073
>
> We'll work on these at a later stage, after the confidentiality basis is
> ready; but we need to make sure the current work on confidentiality enables
> (or at least doesn't block) the future integrity work. For example, we can
> start using https://github.com/apache/iceberg/issues/2060 sooner rather
> than later, for encrypting the Iceberg metadata files and Avro data files.
>
>
>
> That was a high level description, I'll add detailed comments inside the
> design googledoc.
>
>
>
> Cheers, Gidon
>
>
>
>
>
> On Mon, Mar 22, 2021 at 7:25 PM Jack Ye <yezhao...@gmail.com> wrote:
>
> Hi everyone,
>
>
>
> To continue the discussion in the last sync meeting about encryption in
> Iceberg, here is the document for a proposal:
>
>
>
>
> https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing
>
>
>
> Would be very appreciated for any feedback.
>
>
>
> Best,
>
> Jack Ye
>
>

Reply via email to