Hi Jack,

We're working on Parquet encryption, which is about to be released in the
upcoming parquet-mr-1.12 version. Recently, we've started to look into its
integration in Iceberg. It became immediately clear we need to take a wider
view that covers other types of encryption in Iceberg (file streams and
ORC); otherwise, we'd end up with a number of silos.
At the time, there was no top-down design for data encryption in Iceberg,
so we've started to tinker with it. But now we can base this on your
document. I really liked it, a solid foundation.

There are a number of high-level concepts I believe we'd need to add there:

- Key rotation in Iceberg
(Not just in KMS). The envelope encryption practice requires periodic (or
on-demand) re-wrapping of DEKs with new versions of master keys. KMS
generates the new versions, and keeps the master key history, but the
re-wrapped DEKs need to be updated in Iceberg metadata. If key_metadata is
kept in manifest files, this means all manifest files must be deleted
(because they keep DEKs wrapped with the previous master key version, which
is not safe anymore), and created again with the updated key_metadata
field. We've quickly discussed this with Anton, seems to be feasible, but
there are other alternatives. We need to decide if manifests are the right
place to store all key_metadata; and to design a mechanism (potentially
with a DDL clause) to perform the rotation operation.

- Acceleration of KMS interactions
KMSs can be very slow, especially when backed by HSMs. Per the doc, "The
KEK is stored in a key management service (KMS) to control access and key
rotation." We should not fetch secret keys from KMS, because this exposes
them; instead, many KMSs allow to wrap/encrypt DEKs inside the KMS server,
without ever exposing the master keys. But since we have to generate a DEK
per file/column, we'll end up with many KMS wrap calls when writing the
data (and many unwrap calls when reading the data). That's why Parquet
encryption uses a concept of double wrapping, where DEKs are wrapped with
KEKs, and KEKs are wrapped with master keys (MEKs). Only MEKs are
stored/managed inside KMS.

- DDL clauses for encryption and key rotation, such as
ALTER TABLE .. KEY_ROTATION (params)
ALTER TABLE .. ENCRYPT (params): encrypts existing table (with plaintext
files) - Russell's proposal
CREATE TABLE ... ENCRYPTION (params) ; or simply use the TBLPROPERTIES
Btw, we can re-use the joint ORC/Parquet column encryption parameter
format, defined in this jira discussion started by Xinli -
https://issues.apache.org/jira/browse/HIVE-21848

- Cryptographic integrity of Data Tables
Besides protecting data confidentiality, we need to protect its integrity
against tampering attacks. This one is a longer term work item, based on
these tickets:
https://github.com/apache/iceberg/issues/44,
https://github.com/apache/iceberg/issues/2060,
https://github.com/apache/iceberg/issues/2073
We'll work on these at a later stage, after the confidentiality basis is
ready; but we need to make sure the current work on confidentiality enables
(or at least doesn't block) the future integrity work. For example, we can
start using https://github.com/apache/iceberg/issues/2060 sooner rather
than later, for encrypting the Iceberg metadata files and Avro data files.

That was a high level description, I'll add detailed comments inside the
design googledoc.

Cheers, Gidon


On Mon, Mar 22, 2021 at 7:25 PM Jack Ye <yezhao...@gmail.com> wrote:

> Hi everyone,
>
> To continue the discussion in the last sync meeting about encryption in
> Iceberg, here is the document for a proposal:
>
>
> https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing
>
> Would be very appreciated for any feedback.
>
> Best,
> Jack Ye
>

Reply via email to