Proposal: Support for views in Iceberg

2021-07-19 Thread Anjali Norwood
Hello, John Zhuge and I would like to propose the following spec for storing view metadata in Iceberg. The proposal has been implemented [1] and is in production at Netflix for over 15 months. https://docs.google.com/document/d/1wQt57EWylluNFdnxVxaCkCSvnWlI8vVtwfnPjQ6Y7aw/edit?usp=sharing [1] ht

Re: Reading metadata tables

2021-07-19 Thread Peter Vary
Thanks Ryan for checking this out! IcebergWritable wraps a Record to a Container, and a Writable, so that is why I try to create a Record here. The problem is that the metadata table scan returns a StructLike and I have to match that with the metadata schema and then with the read schema. I have

Re: Reading metadata tables

2021-07-19 Thread Ryan Blue
Peter, The "data" tasks produce records using Iceberg's Record class and the internal representations. I believe that's what the existing Iceberg object inspectors use. Couldn't you just wrap this with an IcebergWritable and use the regular object inspectors? On Thu, Jul 15, 2021 at 8:53 AM Peter

Re: [DISCUSS] Adopting the v2 spec changes

2021-07-19 Thread Ryan Blue
I'll reply inline: On Thu, Jul 15, 2021 at 1:46 PM Daniel Weeks wrote: > Overall, I'm in favor of what Ryan has proposed for v2 above. There are a > few points that I'm not entirely clear on, which may warrant discussion: > > 1) The discussion on resurfacing distinct_count may be something we

Re: string bucketing compatibility issue

2021-07-19 Thread Ryan Blue
Thanks, Piotr! I've added this to the agenda for our next sync. I think the main question is whether we think users have hit this problem or not. If we don't think that it is something users have probably hit, then we can just fix the problem. If we think someone has data stored with the incorrect

Re: Java Deserialization Vulnerability

2021-07-19 Thread Ryan Blue
Yes, I think so. Sounds like an unsecured bucket could lead to code execution running with data infrastructure privileges. While it isn't exactly Flink's problem, we should probably treat this like a potential privilege escalation issue. How does Flink handle this for other cases? I think it would

Re: Java Deserialization Vulnerability

2021-07-19 Thread Steven Wu
Let's assume the Flink checkpoint state is uploaded to S3. Attacker needs to be able to read from and write to S3 to manipulate the S3 files. Is this the scenario we are concerned about? On Mon, Jul 19, 2021 at 3:51 PM Ryan Blue wrote: > Thanks, Steven. Do you think that there is a potential pro

Re: Java Deserialization Vulnerability

2021-07-19 Thread Ryan Blue
Thanks, Steven. Do you think that there is a potential problem with an attacker having access to where the state is stored and using that to inject code? Is this something we should just update to avoid it entirely? On Mon, Jul 19, 2021 at 3:43 PM Steven Wu wrote: > I believe Flink source is the

Re: Java Deserialization Vulnerability

2021-07-19 Thread Steven Wu
I believe Flink source is the only place that uses Java serialization for checkpoint state: https://github.com/apache/iceberg/issues/1698. @OpenInx already updated Flink sink to avoid the Java serialization (long time ago) On Mon, Jul 19, 2021 at 1:53 PM Jack Ye wrote: > Yes I totally agree t

Re: Iceberg 0.12.0 Release Plan

2021-07-19 Thread Szehon Ho
Hi Carl, For the Issue: https://github.com/apache/iceberg/issues/2783 The status is: I gave a bit of a try but couldn’t find an easy fix, so hoping someone more knowledgable about this code has cycle to take a look at it. It would be great to fix it for 0.12 as it seems to block more metadata qu

Re: Iceberg 0.12.0 Release Plan

2021-07-19 Thread Carl Steinbach
Hi Everyone, Currently, there are three issues blocking the release of 0.12.0: 1. #2308 Handle the case that RewriteFiles and RowDelta commit the transaction at the same time 2. #2783 Metadata Table Empty Projection - Unknown type for i

Re: Java Deserialization Vulnerability

2021-07-19 Thread Jack Ye
Yes I totally agree that the distributed system itself should make sure the integrity of objects passing across nodes. I am more concerned about the Flink case where some information is persisted and can be modified to execute arbitrary code. Maybe people working on Flink can comment on this a bit

Re: Iceberg 0.12.0 Release Plan

2021-07-19 Thread Jack Ye
I haven't heard any news for the 0.12.0 release since then, are we still planning for the release? Please let me know if there is anything we can do to help speed up the process. (I just saw the release board, will try to at least review those PRs) Best, Jack Ye On Mon, Jul 12, 2021 at 5:41 PM S

Re: Java Deserialization Vulnerability

2021-07-19 Thread Ryan Blue
Jack, I might be incorrect here, but I'll at least throw out some thoughts. If I understand correctly, the attacker requires access to modify some serialized object so that deserialization leads to arbitrary code execution. I think that the best way to protect against that is to avoid making it po

Re: string bucketing compatibility issue

2021-07-19 Thread Piotr Findeisen
Hi, I've filed https://github.com/apache/iceberg/issues/2837 for this as well. Best PF On Sat, Jul 17, 2021 at 12:48 AM Piotr Findeisen wrote: > Hi, > > It was discovered by @Mateusz Gajewski > that Iceberg bucketing > transformation for string isn't regular Murmur3 32-bit hash. > > Upon cl