Re: [Discuss] Document Snapshot Summary Optional Fields for Standardization

2025-01-02 Thread Honah J.
Hi everyone,

Happy new year! I've updated the proposal and PR with the optional snapshot
summary fields documented in a new Appendix in table spec and addressed
review comments. You can find the links below:

   - proposal doc
   

   - PR #11660 

Please take a moment to review them when you have the chance, and feel free
to share any thoughts or questions you may have.

Best regards,
Honah

On Tue, Dec 17, 2024 at 10:02 PM Honah J.  wrote:

> Thank you all for the feedback!
>
> It appears we have reached a consensus on documenting the snapshot summary
> fields. Additionally, there is a preference to document these fields
> outside the main body of the spec and make sure they are not tied to the
> spec version.
>
> Two options have been suggested:
>
>1. Documenting them on a new page at the same level as table
>configuration.
>2. Including them in an appendix within the spec.
>
> Option 1 offers greater flexibility for future additions and
> modifications. However, snapshot summary fields might be too low-level to
> include alongside user-facing topics like Configuration, Schemas, and
> Partitioning. Moreover, referencing versioned documentation within the spec
> might not be feasible.
>
> Option 2 provides a more balanced approach, separating these details from
> the main spec while keeping them within the same document.
>
> I will update the proposal and PR to adopt option 2, moving these fields
> to an appendix in the spec.
>
> Thank you again for your valuable feedback!
>
> Best regards,
> Honah
>
> On Mon, Dec 16, 2024 at 5:26 AM Fokko Driesprong  wrote:
>
>> I'm in favor of this as well. While working on PyIceberg I had to deduce
>> this from the Java code, having a more condensed version in the appendix of
>> the spec would be great.
>>
>> Kind regards,
>> Fokko
>>
>> Op ma 16 dec 2024 om 14:21 schreef Jean-Baptiste Onofré > >:
>>
>>> Hi,
>>>
>>> yes I agree, I don't think we have to couple of spec version.
>>>
>>> Regards
>>> JB
>>>
>>> On Wed, Dec 11, 2024 at 11:17 PM Russell Spitzer
>>>  wrote:
>>> >
>>> > I want to float this back up, I think this is a really good idea for
>>> cross engine support. I don't think we have to tie this to any specific
>>> Spec version since they are just recommendations so I think we can do this
>>> at any time
>>> >
>>> > On Wed, Nov 27, 2024 at 1:31 PM Szehon Ho 
>>> wrote:
>>> >>
>>> >> This makes sense to me generally, I've tried a few times to search in
>>> the spec to find a list of possible snapshot summary properties, and was a
>>> bit surprised to not find them there.  So I think this would be a nice
>>> addition.
>>> >>
>>> >> I'm curious if there's any historical reason it's not been included
>>> in the spec.
>>> >>
>>> >> Thanks
>>> >> Szehon
>>> >>
>>> >> On Wed, Nov 27, 2024 at 10:55 AM Kevin Liu 
>>> wrote:
>>> >>>
>>> >>> Thanks for driving this Honah!
>>> >>>
>>> >>> It's important to have a consistent naming scheme so that we don't
>>> need to worry about edge cases when using multiple engines, and possibly
>>> have to deal with migrations.
>>> >>>
>>> >>> Also, since users can store arbitrary key/value pairs in the summary
>>> property, it's good to document the currently used properties to avoid
>>> collision.
>>> >>>
>>> >>> I like the proposal to document all properties in a "snapshot
>>> summary" table, this will ensure a centralized place to view all possible
>>> key/value pairs, similar to how FileIO configuration is handled in
>>> iceberg-python. Other implementations can use this table as a reference.
>>> >>>
>>> >>>  > This approach offers flexibility, as new fields can be added
>>> through documentation updates without requiring specification changes.
>>> >>> This will save a lot of effort since specification changes require
>>> greater scrutiny.
>>> >>>
>>> >>> > summary details would not be located near the Snapshot section,
>>> which explains the summary field.
>>> >>> We can link the table to the Snapshot section.
>>> >>>
>>> >>>
>>> >>> Would love to hear others' thoughts on this.
>>> >>>
>>> >>> Best,
>>> >>> Kevin Liu
>>> >>>
>>> >>> On Tue, Nov 26, 2024 at 2:50 PM Honah J.  wrote:
>>> 
>>>  Hi everyone,
>>> 
>>>  I’d like to propose an addition to the table specification to
>>> document optional fields in the snapshot summary.
>>> 
>>>  Currently, the snapshot summary includes a required operation field
>>> and various optional fields. While these optional fields—such as metrics
>>> and partition-level summaries—are supported by Java and Python
>>> implementations, they are not officially documented. This creates risks of
>>> inconsistency as other implementations and engines adopt and interact with
>>> these fields.
>>> 
>>>  I propose adding a new section to the table specification to
>>> document these optional fields, ensuring consistent n

Re: [Discussion] Maintain vendor neutrality on the quickstart page

2025-01-02 Thread Yong Zheng
I also think that is a good idea and interested to take this up. But before I 
do so, here are couple clarifications i would like. Based on what I recalled 
from tabulario/spark-iceberg, it has couple additional things are not there 
compared to official spark image:
1. couple additional python libs (e.g. jupyter notebook, pyiceberg, matplotlib, 
etc.)
- for this one, if we switched to official spark image, do we want to build our 
custom image on top of that and also include those? they are not technically 
required as we don't mention jupyter notebook on our page nor README.md
2. couple notebooks published by Tabular earlier for simulating various 
behaviors with iceberg
- if we still want to use those notebooks, we will need to create a form of 
those or we can do our own?
3. there is an ipython line cell magic used in jupyter notebook
- same as above, only needed if we want to use the same set of notebooks. 
4. they included IJava and scala kernels in jupyter notebook
- same as above, only needed if we want to use the same set of notebooks. 
5. couple parquet files for dummy dataset
- same as above, only needed if we want to use the same set of notebooks. 

That being said, if I would want to take this, should we remove the dependency 
on jupyter notebook and those extra notebooks/kernerl/libs etc.?

On 2024/12/10 11:29:06 Fokko Driesprong wrote:
> Yes, that's exactly my motivation (sorry for not stating this explicitly
> earlier). Looking at the fact that the quickstart is currently outdated, I
> would be reluctant to introduce additional Docker images and/or
> repositories, since we need to update those as well.
> 
> Kind regards,
> Fokko
> 
> Op di 10 dec 2024 om 11:48 schreef Ajantha Bhat :
> 
> > That's a good suggestion Fokko.
> > It would avoid maintaining one more docker image. We can update the
> > quickstart to use the docker image provided by Spark.
> >
> > - Ajantha
> >
> > On Tue, Dec 10, 2024 at 4:08 PM Fokko Driesprong  wrote:
> >
> >> Hey Ajantha,
> >>
> >> Thanks for bringing this up, we should both remove the vendor reference
> >> and bring this back up to date. My preference would be to rely on the Spark
> >> image  provided by the Apache
> >> Spark project, similar to what we do for the Hive
> >>  quickstart. We should be
> >> able to load all the Iceberg-specific JARs through the
> >> spark.jars.packages configuration
> >> .
> >>
> >> Kind regards,
> >> Fokko
> >>
> >> Op di 10 dec 2024 om 11:16 schreef Ajantha Bhat :
> >>
> >>> The quickstart  page is a
> >>> critical touchpoint for new users and plays a key role in driving project
> >>> adoption.
> >>> Currently, it references *tabulario/spark-iceberg* and
> >>> *tabulario/iceberg-rest*
> >>>
> >>> We’ve already replaced *tabulario/iceberg-rest* with the
> >>> community-maintained Docker image, *apache/iceberg-rest-fixture*, based
> >>> on the REST TCK fixture.
> >>>
> >>> However, *tabulario/spark-iceberg* seems outdated, and doesn't use the
> >>> latest Iceberg version.
> >>> To enhance the user experience and keep the quickstart aligned with
> >>> project standards, I suggest hosting it either under the /docker folder in
> >>> the Iceberg repository
> >>> or as a subproject called *apache/iceberg-playground* where users can
> >>> contribute to maintain other docker images.
> >>>
> >>> The quickstart page should ideally reference images maintained by the
> >>> community rather than vendor-specific open-source projects.
> >>>
> >>> Thoughts?
> >>>
> >>> - Ajantha
> >>>
> >>
> 


Re: Changing default delete file granularity for Spark writes from partition to file scoped

2025-01-02 Thread Russell Spitzer
I think that makes sense to do, I'll do a review this morning

On Mon, Dec 23, 2024 at 9:53 AM Amogh Jahagirdar <2am...@gmail.com> wrote:

> Hey all, it's been a while but I wanted to follow up on this thread in
> case there were any remaining thoughts on changing the default write
> granularity in Spark to be file scoped.
>
> Given the degenerate amplification case for partition scoped deletes and
> the additional benefits outlined earlier, I think it would be great if we
> could move towards this.
>
> I'd also like to point out that the latest version of my PR [1] changes
> the default to file-scoped deletes for all Spark 3.5 writes via
> SparkWriteConf rather than just new V2 tables. It's potentially a wider
> impacting change but considering the benefits mentioned earlier in the
> thread it seems worth it. In the chance that any particular job has issues
> with this default change, the conf can be overridden.
>
> Please let me know what you think!
>
> [1] https://github.com/apache/iceberg/pull/11478
>
> Thanks,
>
> Amogh Jahagirdar
>
> On Tue, Nov 26, 2024 at 3:54 PM Amogh Jahagirdar <2am...@gmail.com> wrote:
>
>>
>>
>> Just following up on this thread,
>>
>> Getting numbers for various table layouts is involved and I think it
>> would instead be helpful to look at a degenerate read amplification case
>> arising from partition-scoped deletes.
>> Furthermore, there are some additional benefits to file scope deletes
>> that are worth clarifying that benchmarking alone won't capture.
>>
>>
>> *Degenerate Read Amplification Case for Partition Scoped Deletes*
>> Consider a table with id and data columns, partitioned with bucket(10,
>> id) and 10 data files per bucket. With partition scoped deletes, if a
>> delete affects only two files within a bucket, any read touching the 8
>> unaffected files in a partition will unnecessarily fetch the delete file.
>> Generalized: Say there are D data files per partition, W writes performed,
>> and I impacted data files for deletes, where I << D
>>
>>- Partition-scoped: O(D * W) delete file reads in worst case for a
>>given partition because each write will produce a partition scoped delete
>>which would need to be read for each data file.
>>- File-scoped: O(I) delete file reads in worst case for a given
>>partition
>>
>> The key above is how read amplification with partition scoped deletes can
>> increase with every write that's performed, and this is further compounded
>> in aggregate by how many partitions are impacted as well.
>> The file-scoped deletes that need to be read scale independently of the
>> number of writes that are performed since they're targeted per data file
>> and each delete is being maintained on write to make sure there's not
>> multiple delete files for a given data file.
>>
>> *Additional benefits to file scoped deletes*:
>>
>>- Now that file scoped deletes are being maintained as part of
>>writes, old deletes will be removed from storage. With partition scoped
>>deletes, the delete file cannot even be removed even if it's replaced for 
>> a
>>given data file since there can be deletes for other data files.
>>- Moving towards file scoped deletes will help avoid unnecessary
>>conflicts with concurrent compactions/other writes.
>>- There are other engines which already write file scoped deletes,
>>Anton mentioned Flink earlier in the thread, which also ties into point 1
>>above, since much of the goal of writing file scoped deletes was to avoid
>>unnecessary conflicts with concurrent compactions. Trino
>>
>> 
>>is another engine which writes file scoped position deletes as of today.
>>- We know that V3 deletion vectors will be an improvement in the
>>majority of cases when compared to existing position deletes. Considering
>>this and the fact that file-scoped deletes are closer to DVs as Anton
>>mentioned, moving towards file scoped deletes makes it easier to migrate
>>from the file-scoped deletes to DVs.
>>
>> It's worth clarifying that in the case there is some regression for a
>> given workload it's always possible to set the property back to partition
>> scoped.
>>
>> Thanks,
>> Amogh Jahagirdar
>>
>> On Fri, Nov 15, 2024 at 11:54 AM Amogh Jahagirdar <2am...@gmail.com>
>> wrote:
>>
>>> Following up on this thread,
>>>
>>> > I don't think this is a bad idea from a theoretical perspective. Do we
>>> have any actual numbers to back up the change?
>>> There are no numbers yet, changing the default is largely driven by the
>>> fact that the previous downside of file scoped deletes leading to many
>>> files on disk, is now mitigated by Spark sync maintenance.
>>>
>>> To get some more numbers, I think we'll first need to update some of our
>>> benchmarks to be more representative of typical

Re: There is no easy way to secure Iceberg data. How can we improve?

2025-01-02 Thread Steve Loughran
if the data is stored in S3 then if someone has unrestricted access to a
single store containing all the data (default without S3 access grants,
cloudera ranger extensions or some other access control mechanism to grant
access to clients without sharing credentials) - then it's effectively
impossible to stop the clients being able to read it.

encryption of the parquet data is about all you can do. I know parquet
encryption has always cited cloud KMS hardware as a keystore (
https://parquet.apache.org/docs/file-format/data-pages/encryption/ ) but I
don't know of any implementations of that. Do that and you can secure
column access by restricting which  IAM roles have decrypt permissions:
this does *not* have to be the same roles which can encrypt the data.

On Wed, 1 Jan 2025 at 18:51, Vladimir Ozerov 
wrote:

> Hi,
>
> Apache Iceberg can address multiple analytical scenarios, including ETL,
> streaming, ad-hoc queries, etc. One important obstacle in Iceberg
> integration nowadays is secure access to Iceberg tables across multiple
> tools and engines. There are several typical approaches to lakehouse
> security:
>
>1. Controlled environment. E.g., Google BigQuery has special
>readers/writers for open formats, tightly integrated with managed engines.
>Doesn't work outside of a specific cloud vendor.
>2. Securing storage. E.g., various S3 access policies. Works for
>individual files/buckets but can hardly address important access
>restrictions, such as column access permissions, masking, and filtering.
>Tightly integrated solutions, such as AWS S3 Tables, can potentially solve
>these, but this implies a cloud vendor lock-in.
>3. Catalog-level permissions. For example, a Tabular/Polaris role
>model, possibly with vended credentials or remote request signature. Works
>for coarse-grained access permissions but fails to deliver proper access
>control for individual columns, as well as masking and filtering.
>4. Centralized security service. E.g., Apache Ranger, OPA. It could
>provide whatever security permissions, but each engine must provide its own
>integration with the service. Also, some admins of such services usually
>have to duplicate access permissions between different engines. For
>example, the column masking policy for Trino in Apache Ranger will not work
>for Apache Spark.
>5. Securing data with virtual views. Works for individual engines, but
>not across engines. There is an ongoing discussion about common IR with
>Substrait, but given the complexity of engine dialects, we can hardly
>expect truly reusable views any time soon. Moreover, similarly to Apache
>Ranger, this shifts security decisions towards the engine, which is not
>good.
>
> To the best of my knowledge, the above-mentioned strategies are some of
> the "state-of-the-art"  techniques for secure lakehouse access. I would
> argue that none of these strategies are open, secure, interoperable, and
> convenient for end users simultaneously. Compare it with security
> management in monolithic systems, such as Vertica: execute a couple of SQL
> statements, done.
>
> Having a solid vision of a secure lakehouse could be a major advantage for
> Apache Iceberg. I would like to kindly ask the community about your
> thoughts on what are the current major pain points with your Iceberg-based
> deployments security and what could be done at the Iceber level to further
> improve it.
>
> My 5 cents. REST catalog is a very good candidate for a centralized
> security mechanism for the whole lakehouse, irrespective of the engine that
> accesses data. However, the security capabilities of the current REST
> protocol are limited. We can secure individual catalogs, namespaces, and
> tables. But we cannot:
>
>1. Define individual column permission
>2. Apply column making
>3. Apply row-level filtering
>
> Without solutions to these requirements, Iceberg will not be able to
> provide complete and coherent data access without resorting to third-party
> solutions or closed cloud vendor ecosystems.
>
> Given that data is organized in a columnar fashion in Parquet/ORC, which
> is oblivious to catalog and store, and Iceberg itself cannot evaluate
> additional filters, what can we do? Are there any iterative
> improvements that we can make to the Iceberg protocol to improve these? And
> is it Iceberg concern in the first place, or shall we refrain from going
> into this security rabbit hole?
>
> Several very rough examples of potential improvements:
>
>1. We can think about splitting table data into multiple files for
>column-level security and masking. For example, instead of storing columns
>[a, b, c] in the same Parquet file, we split them into three files: [a, b],
>[c], [c_masked]. Then, individual policies could be applied to these files
>at the catalog or storage layer. This requires spec change.
>2. For row-level filtering, we can th

Re: There is no easy way to secure Iceberg data. How can we improve?

2025-01-02 Thread Jean-Baptiste Onofré
Hi Vladimir,

Thanks for starting this discussion.

I agree with you that the REST catalog "should" be the centralized
security mechanism (Polaris is a good example). However, we have two
challenges today:
- there's no enforcement to use the REST catalog. Some engines are
still directly accessing the metadata.json without going through a
catalog. Without "enforcing" catalog use (and especially REST
catalog), it's not really possible to have a centralized security
mechanism across engines.
- the "entity" permission model (table, view, namespace) is REST
catalog impl side (server side).

I think we are mixing two security layers here: the REST and entity
security (RBAC, etc) and the storage (credential vending).

Thinking aloud, I would consider the storage as "internal security"
and REST catalog as "user facing security". Why not consider
"enforcing" REST Catalog in the Iceberg ecosystem ? It would
"standardize" the "user facing security" (and the implementation can
implement credentials vending for the storage).

Just my $0.01 :)

Regards
JB

On Wed, Jan 1, 2025 at 7:51 PM Vladimir Ozerov  wrote:
>
> Hi,
>
> Apache Iceberg can address multiple analytical scenarios, including ETL, 
> streaming, ad-hoc queries, etc. One important obstacle in Iceberg integration 
> nowadays is secure access to Iceberg tables across multiple tools and 
> engines. There are several typical approaches to lakehouse security:
>
> Controlled environment. E.g., Google BigQuery has special readers/writers for 
> open formats, tightly integrated with managed engines. Doesn't work outside 
> of a specific cloud vendor.
> Securing storage. E.g., various S3 access policies. Works for individual 
> files/buckets but can hardly address important access restrictions, such as 
> column access permissions, masking, and filtering. Tightly integrated 
> solutions, such as AWS S3 Tables, can potentially solve these, but this 
> implies a cloud vendor lock-in.
> Catalog-level permissions. For example, a Tabular/Polaris role model, 
> possibly with vended credentials or remote request signature. Works for 
> coarse-grained access permissions but fails to deliver proper access control 
> for individual columns, as well as masking and filtering.
> Centralized security service. E.g., Apache Ranger, OPA. It could provide 
> whatever security permissions, but each engine must provide its own 
> integration with the service. Also, some admins of such services usually have 
> to duplicate access permissions between different engines. For example, the 
> column masking policy for Trino in Apache Ranger will not work for Apache 
> Spark.
> Securing data with virtual views. Works for individual engines, but not 
> across engines. There is an ongoing discussion about common IR with 
> Substrait, but given the complexity of engine dialects, we can hardly expect 
> truly reusable views any time soon. Moreover, similarly to Apache Ranger, 
> this shifts security decisions towards the engine, which is not good.
>
> To the best of my knowledge, the above-mentioned strategies are some of the 
> "state-of-the-art"  techniques for secure lakehouse access. I would argue 
> that none of these strategies are open, secure, interoperable, and convenient 
> for end users simultaneously. Compare it with security management in 
> monolithic systems, such as Vertica: execute a couple of SQL statements, done.
>
> Having a solid vision of a secure lakehouse could be a major advantage for 
> Apache Iceberg. I would like to kindly ask the community about your thoughts 
> on what are the current major pain points with your Iceberg-based deployments 
> security and what could be done at the Iceber level to further improve it.
>
> My 5 cents. REST catalog is a very good candidate for a centralized security 
> mechanism for the whole lakehouse, irrespective of the engine that accesses 
> data. However, the security capabilities of the current REST protocol are 
> limited. We can secure individual catalogs, namespaces, and tables. But we 
> cannot:
>
> Define individual column permission
> Apply column making
> Apply row-level filtering
>
> Without solutions to these requirements, Iceberg will not be able to provide 
> complete and coherent data access without resorting to third-party solutions 
> or closed cloud vendor ecosystems.
>
> Given that data is organized in a columnar fashion in Parquet/ORC, which is 
> oblivious to catalog and store, and Iceberg itself cannot evaluate additional 
> filters, what can we do? Are there any iterative improvements that we can 
> make to the Iceberg protocol to improve these? And is it Iceberg concern in 
> the first place, or shall we refrain from going into this security rabbit 
> hole?
>
> Several very rough examples of potential improvements:
>
> We can think about splitting table data into multiple files for column-level 
> security and masking. For example, instead of storing columns [a, b, c] in 
> the same Parquet file, we split them int