Re: [Discuss] Replace Hadoop Catalog Examples with JDBC Catalog in Documentation

2024-10-16 Thread Renjie Liu
Hi: Here's what I propose as a middle-ground. > >1. We replace the Hadoop catalog example with a JDBC catalog backed by >an in-memory datastore. This allows users to get started without needing >additional infrastructure, which was one of the main benefits of the Hadoop >catalog. >

Re: [VOTE] Table V3 Spec: Row Lineage

2024-10-16 Thread Anton Okolnychyi
Late +1 from me too. пн, 14 жовт. 2024 р. о 12:45 Russell Spitzer пише: > Vote has completed and the Proposal is approved > > +1 Votes From > > Steven Wu > Yufei Gu > Amogh Jahagirdar > Jack Ye > Daniel Weeks > Eduard Tudenhöfner > Jean-Baptiste Onofré > > > On Thu, Oct 10, 2024 at 8:11 PM Steve

Re: [PROPOSAL] Add manifest-level statistics for CBO estimation

2024-10-16 Thread Anton Okolnychyi
Does the doc suggest it is too expensive to aggregate min/max stats after planning files (i.e. after loading matching files in memory)? Do we have any benchmarks to refer to? We will have to read manifests for planning anyway, right? Also, the doc proposes to add column level stats to the manifest

Re: [DISCUSS] Discrepancy Between Iceberg Spec and Java Implementation for Snapshot summary's 'operation' key

2024-10-16 Thread Anton Okolnychyi
Based on [1], we never persisted the operation in the summary map. Instead, we persisted it as a top-level field in Java, which is actually NOT what the spec says. Does anyone remember cases when the operation was unknown? I personally don't. [1] - https://github.com/apache/iceberg/blob/17f1c4d220

Re: [DISCUSS] [PyIceberg] Use of asserts to "programming the negative space"

2024-10-16 Thread André Luis Anastácio
> These are not my recommendations to use. These are only my recommendations > for deeper research if we are about to roll something on our own. It sounds > unlikely that such a fundamental need is not addressed in Python ecosystem. Agreed, I was already aware of all the options you provided. Ho

[DISCUSS] Discrepancy Between Iceberg Spec and Java Implementation for Snapshot summary's 'operation' key

2024-10-16 Thread Kevin Liu
Hey folks, I’ve noticed a discrepancy between the Iceberg specification and the Java implementation regarding the `operation` key in the `Snapshot` `summary` field. The `Snapshot` object's `summary` dictionary includes a *required* key named `operation`, as outlined in the spec describing Table M

Re: Spec changes for deletion vectors

2024-10-16 Thread Daniel Weeks
Hey Everyone, I feel like at this point we've articulated all of the various options and paths forward, but this really just comes down to a matter of whether we want to make a concession here for the purpose of compatibility. If we were building this with no prior art, I would expect to omit the

Re: [Discuss] Replace Hadoop Catalog Examples with JDBC Catalog in Documentation

2024-10-16 Thread Kevin Liu
Hey folks, Thanks for the discussions. It seems everyone is in favor of replacing the Hadoop catalog example, and the question now is whether to replace it with the JDBC catalog or the REST catalog. I originally proposed the JDBC catalog as a replacement primarily due to its ease of use. User

Re: [DISCUSS] [PyIceberg] Use of asserts to "programming the negative space"

2024-10-16 Thread Kevin Liu
Thanks for starting this discussion! I think the defensive programming approach is useful to maintain assumptions, especially in some public-facing APIs. Here is an example I recently encountered [1]; we currently disallow using the `add_files` API for parquet files with field IDs. However, I'm not

Re: Spec changes for deletion vectors

2024-10-16 Thread rdb...@gmail.com
Thanks, Russell for the clear summary of the pros and cons! I agree there's some risk to Iceberg implementations, but I think that is mitigated somewhat by code reuse. For example, an engine like Trino could simply reuse code for reading Delta bitmaps, so we would get some validation and support mo

Re: Spec changes for deletion vectors

2024-10-16 Thread Micah Kornfield
One small point > Theoretically we could end up with iceberg implementers who have bugs in > this part of the code and we wouldn’t even know it was an issue till > someone converted the table to delta. I guess we could mandate readers validate all fields here to make sure they are all consistent

Re: [DISCUSS] [PyIceberg] Use of asserts to "programming the negative space"

2024-10-16 Thread Piotr Findeisen
Hi Andre, My Python skills aren't up to date, so I will abstain from recommending a particular solution. Writing a precondition module sounds like a fun task, but perhaps we could research alternatives first. For example quick google search brought me to https://pypi.org/project/preconditions/ htt