Hi:
Here's what I propose as a middle-ground.
>
>1. We replace the Hadoop catalog example with a JDBC catalog backed by
>an in-memory datastore. This allows users to get started without needing
>additional infrastructure, which was one of the main benefits of the Hadoop
>catalog.
>
Late +1 from me too.
пн, 14 жовт. 2024 р. о 12:45 Russell Spitzer
пише:
> Vote has completed and the Proposal is approved
>
> +1 Votes From
>
> Steven Wu
> Yufei Gu
> Amogh Jahagirdar
> Jack Ye
> Daniel Weeks
> Eduard Tudenhöfner
> Jean-Baptiste Onofré
>
>
> On Thu, Oct 10, 2024 at 8:11 PM Steve
Does the doc suggest it is too expensive to aggregate min/max stats after
planning files (i.e. after loading matching files in memory)? Do we have
any benchmarks to refer to? We will have to read manifests for planning
anyway, right?
Also, the doc proposes to add column level stats to the manifest
Based on [1], we never persisted the operation in the summary map. Instead,
we persisted it as a top-level field in Java, which is actually NOT what
the spec says. Does anyone remember cases when the operation was unknown? I
personally don't.
[1] -
https://github.com/apache/iceberg/blob/17f1c4d220
> These are not my recommendations to use. These are only my recommendations
> for deeper research if we are about to roll something on our own. It sounds
> unlikely that such a fundamental need is not addressed in Python ecosystem.
Agreed, I was already aware of all the options you provided. Ho
Hey folks,
I’ve noticed a discrepancy between the Iceberg specification and the Java
implementation regarding the `operation` key in the `Snapshot` `summary`
field.
The `Snapshot` object's `summary` dictionary includes a *required* key
named `operation`, as outlined in the spec describing Table M
Hey Everyone,
I feel like at this point we've articulated all of the various options and
paths forward, but this really just comes down to a matter of whether we
want to make a concession here for the purpose of compatibility.
If we were building this with no prior art, I would expect to omit the
Hey folks,
Thanks for the discussions.
It seems everyone is in favor of replacing the Hadoop catalog example, and
the question now is whether to replace it with the JDBC catalog or the REST
catalog.
I originally proposed the JDBC catalog as a replacement primarily due to
its ease of use. User
Thanks for starting this discussion! I think the defensive programming
approach is useful to maintain assumptions, especially in some
public-facing APIs. Here is an example I recently encountered [1]; we
currently disallow using the `add_files` API for parquet files with
field IDs. However, I'm not
Thanks, Russell for the clear summary of the pros and cons! I agree there's
some risk to Iceberg implementations, but I think that is mitigated
somewhat by code reuse. For example, an engine like Trino could simply
reuse code for reading Delta bitmaps, so we would get some validation and
support mo
One small point
> Theoretically we could end up with iceberg implementers who have bugs in
> this part of the code and we wouldn’t even know it was an issue till
> someone converted the table to delta.
I guess we could mandate readers validate all fields here to make sure they
are all consistent
Hi Andre,
My Python skills aren't up to date, so I will abstain from recommending a
particular solution.
Writing a precondition module sounds like a fun task, but perhaps we could
research alternatives first.
For example quick google search brought me to
https://pypi.org/project/preconditions/
htt
12 matches
Mail list logo