Re: [DISCUSS] spec: remove the file scan task JSON serialization section from table spec

2024-04-22 Thread Steven Wu
still need some help on reviewing the PR that reverted/removed the JSON spec for content file and file scan task. https://github.com/apache/iceberg/pull/9771/files On Wed, Feb 21, 2024 at 4:01 PM Jack Ye wrote: > I see. I was asking for the devlist discussion history, because this is > related

Re: [Proposal] Add support for Materialized Views in Iceberg

2024-04-22 Thread Ryan Blue
+1 for separate table and view objects and not needing to introduce unnecessary combined APIs. On Mon, Apr 22, 2024 at 1:51 PM Szehon Ho wrote: > +1 for the approach given it reduces the work. On this, as it exposes > storage tables to user catalog, I was mainly thinking we should have a > comm

Re: FlinkFileIO implementation

2024-04-22 Thread Ryan Blue
I think the idea of introducing a Flink-specific FileIO isn't a good idea. The intent of the Java API is for a table to use the FileIO instance that is supplied by the table object. That puts the responsibility for supplying a correctly configure FileIO on the catalog, which is the right place to i

Re: [Proposal] Add support for Materialized Views in Iceberg

2024-04-22 Thread Szehon Ho
+1 for the approach given it reduces the work. On this, as it exposes storage tables to user catalog, I was mainly thinking we should have a common suffix/naming pattern for storage table across catalog. The netflix approach sounds good to me. Hope we can continue the proposal, as there's still

Re: [VOTE] Release Apache Iceberg 1.5.1 RC0

2024-04-22 Thread Szehon Ho
+1 (binding) * Verify signature * Verify checksum * Verify licenses * Build and run basic test with Spark 3.5 Thanks Szehon On Sun, Apr 21, 2024 at 11:45 PM Ajantha Bhat wrote: > +1 (non-binding) > > * validated checksum and signature > * checked license docs & ran RAT checks > * ran build and

Re: How to Set S3 Credentials at bucket level in Iceberg Spark Session

2024-04-22 Thread Pani Dhakshnamurthy
Hi Awashi, S3A supports setting credentials at the S3 bucket level - ref: https://docs.cloudera.com/runtime/7.2.0/cloud-data-access/topics/cr-cda-configuring-per-bucket-settings.html . I am not sure if S3FileIO supports this feature. Thanks Pani On Mon, Apr 22, 2024 at 2:01 PM Yufei Gu wro

Re: How to Set S3 Credentials at bucket level in Iceberg Spark Session

2024-04-22 Thread Yufei Gu
Hi Awasthi, How about configuring two catalogs in Spark? One points to the source data, and another points to the target. You can configure different credentials in that case. Yufei On Mon, Apr 22, 2024 at 8:49 AM Awasthi, Somesh wrote: > Hi Jack/Dev Team, > > > > We want to pass separate cr

RE: FlinkFileIO implementation

2024-04-22 Thread Ferenc Csaky
Hi Peter, I am coming from the Flink side, but at Cloudera we also use Iceberg as well. Utilizing the Flink delegation token fw via the Iceberg Java API would be great. I think that simplifying the configuration for Flink related cases also has value on its own, and could help to eliminate some c

How to Set S3 Credentials at bucket level in Iceberg Spark Session

2024-04-22 Thread Awasthi, Somesh
Hi Jack/Dev Team, We want to pass separate credential for source reading data from s3 and separate credential for target writing data to s3 using glue catalog, but now we are unable to set credential at bucket level and not able get any help from any forum. Could you please check and help me a

Re: Query regarding UPSERT Mode in Flink

2024-04-22 Thread Péter Váry
Very high change rate means that it is most probably worth it to rewrite the data files from time to time. The high change rate itself causes write amplification, as every new version of the record is essentially a new row to the table (and a tombstone for the old data). The ConvertEqualityDeleteFi

Re: Iceberg table maintenance

2024-04-22 Thread Péter Váry
Hi Nathan Ma, Thanks for joining the discussion! 1-2. In the other thread ( https://lists.apache.org/thread/10mdf9zo6pn0dfq791nf4w1m7jh9k3sl) we discussed the resource consumption/buffers/checkpointing in detail. We agreed that even if we try to do everything to separate the stream processing fro

Re: Query regarding UPSERT Mode in Flink

2024-04-22 Thread Aditya Narayan Gupta
Hi Péter, Thanks for the detailed answers, we have some CDC streams that have very high change rate, for such tables we were thinking to leverage ConvertEqualityDeleteFiles t

Re: Iceberg table maintenance

2024-04-22 Thread nathan ma
Hi guys, thanks for raising this wonderful thread, and thanks to @ peter.vary.apa...@gmail.com, the doc really makes things clear I have several questions to catch up and I did't find answers in the document. I would appreciate that anyone could give some feedbacks cause the issues really importa

Re: Iceberg table maintenance

2024-04-22 Thread Péter Váry
Hi Team, The discussion continued on the previous thread by Gen Luo and Zhu Zhu. See: https://lists.apache.org/thread/10mdf9zo6pn0dfq791nf4w1m7jh9k3sl. Adding them to this thread too, so we can continue the discussion in one place. Based on their thoughts there, the current suggestion is: Q: Do y

Re: Equality deletes with Flink - design question

2024-04-22 Thread Manu Zhang
Hi, Without knowledge of this history, my reasoning is we will have deduplication across partitions if partition col is not included in the identifier fields (pks). That doesn't look right to me. For example, if a user table has uid as primary key and dt as partition, and user data are inserted e

Re: Query regarding UPSERT Mode in Flink

2024-04-22 Thread Péter Váry
Hi Aditya, See my answers below: Aditya Narayan Gupta ezt írta (időpont: 2024. ápr. 20., Szo, 11:05): > Hi Péter, Gabor, > > Thanks a lot for clarifying and providing additional information, I had > few followup queries- > 1. We want to ingest an CDC stream using Flink to Iceberg sink, if we >