; >> writing in the long run.
> > >>
> > >> Thanks,
> > >> Ajantha
> > >>
> > >> On Wed, May 17, 2023 at 1:38 AM Mayur Srivastava <
> > >> mayur.srivast...@twosigma.com> wrote:
> > >>
> > >>> I agree, it tot
dified time” per
> >>> partition is implemented.
> >>>
> >>> I’m concerned about performance of computing partition stats (and
> >>> storage + the size of table metadata files) if the implementation requires
> >>> users to keep around all snapsh
agree, it totally depends on the way “last modified time” per
>>> partition is implemented.
>>>
>>> I’m concerned about performance of computing partition stats (and
>>> storage + the size of table metadata files) if the implementation requires
>>> users to k
erned about performance of computing partition stats (and storage
>> + the size of table metadata files) if the implementation requires users to
>> keep around all snapshots. (I described one of my use case in this thread
>> earlier.)
>>
>>
>>
>> *Fr
he implementation requires users to
> keep around all snapshots. (I described one of my use case in this thread
> earlier.)
>
>
>
> *From:* Pucheng Yang
> *Sent:* Monday, May 15, 2023 11:46 AM
> *To:* dev@iceberg.apache.org
> *Subject:* Re: [Proposal] Partition stats in
case in this thread earlier.)
From: Pucheng Yang
Sent: Monday, May 15, 2023 11:46 AM
To: dev@iceberg.apache.org
Subject: Re: [Proposal] Partition stats in Iceberg
Hi Mayur, can you elaborate your concern? I don't know how this is going to be
implemented so not sure where the performance iss
*From:* Ryan Blue
> *Sent:* Wednesday, May 3, 2023 2:00 PM
> *To:* dev@iceberg.apache.org
> *Subject:* Re: [Proposal] Partition stats in Iceberg
>
>
>
> Mayur, your use case may require a lot of snapshots, but we generally
> recommend expiring them after a few days. You ca
Blue
Sent: Wednesday, May 3, 2023 2:00 PM
To: dev@iceberg.apache.org
Subject: Re: [Proposal] Partition stats in Iceberg
Mayur, your use case may require a lot of snapshots, but we generally recommend
expiring them after a few days. You can tag snapshots to keep them around
longer than that.
On
gt;>> use the change log (CDC) but I think that is too heavy (I guess, since it
>>>>>> requires to run SparkSQL procedure) and it is over do the work (I don't
>>>>>> need what rows are changed, I just need true or false for whether a
>>>>>
;>>> requires to run SparkSQL procedure) and it is over do the work (I don't
>>>>> need what rows are changed, I just need true or false for whether a
>>>>> partition is changed).
>>>>>
>>>>> Thanks
>>>>>
>>>>
gt;>
>>>>> Thanks Ajantha.
>>>>>
>>>>>
>>>>>
>>>>> > It should be very easy to add a few more fields to it like the
>>>>> latest sequence number or last modified time per partition.
>>>>>
>>>>
t;>>
>>>> Thanks Ajantha.
>>>>
>>>>
>>>>
>>>> > It should be very easy to add a few more fields to it like the latest
>>>> sequence number or last modified time per partition.
>>>>
>>>>
>>&
t; add a few more fields to it like the latest sequence number or last
>>>> modified time per partition.
>>>> I will be opening up the discussion about phase 2 schema again once
>>>> phase 1 implementation is done.
>>>>
>>>> Thanks,
>&
>>> likely to be available in Iceberg partition stats? Note that we would like
>>> to avoid compaction change the sequence number or modified time stats.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
e the sequence number or modified time stats.
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Ajantha Bhat
>> *Sent:* Tuesday, February 7, 2023 10:02 AM
>> *To:* dev@iceberg.apache.org
>> *Subject:* Re: [Proposal] Parti
change the sequence number or modified time stats.
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Ajantha Bhat
>> *Sent:* Tuesday, February 7, 2023 10:02 AM
>> *To:* dev@iceberg.apache.org
>> *Subject:* Re: [Proposal
in Iceberg partition stats? Note that we would like
> to avoid compaction change the sequence number or modified time stats.
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Ajantha Bhat
> *Sent:* Tuesday, February 7, 2023 10:02 AM
> *To:* dev@iceberg.apache.org
&g
ike to avoid
compaction change the sequence number or modified time stats.
Thanks,
Mayur
From: Ajantha Bhat
Sent: Tuesday, February 7, 2023 10:02 AM
To: dev@iceberg.apache.org
Subject: Re: [Proposal] Partition stats in Iceberg
Hi Hrishi and Mayur, thanks for the inputs.
To get things moving
cy requirements.
>
>
>
> Is partition stats a good place for storing last-modified-time per
> partition?
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Ajantha Bhat
> *Sent:* Monday, January 23, 2023 11:56 AM
> *To:* dev@iceberg.apache.org
>
-time per partition?
Thanks,
Mayur
From: Ajantha Bhat
Sent: Monday, January 23, 2023 11:56 AM
To: dev@iceberg.apache.org
Subject: Re: [Proposal] Partition stats in Iceberg
Hi All,
In the same design document
(https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit
Hi All,
In the same design document (
https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
),
I have added a section called
*"Design for approval". *It also contains a potential PR breakdown for the
phase 1 implementation and future development scope.
Pl
A big thanks to everyone who was involved in the review and the discussions
so far.
Please find the meeting minutes from the last iceberg sync about the
partition stats.
a. Writers should not write the partition stats or any stats as of now.
Because it requires bumping the spec to V3.
Hi Ryan,
are you saying that you think the partition-level stats should not be
> required? I think that would be best.
I think there is some confusion here. Partition-level stats are
required (hence the proposal).
But does the writer always write it? (with the append/delete/replace
operation)
or
Ajantha, are you saying that you think the partition-level stats should not
be required? I think that would be best.
I’m all for improving the interface for retrieving stats. It’s a separate
issue, but I think that Iceberg should provide both access to the Puffin
files and metadata as well as a hi
Hi Ryan,
Thanks a lot for the review and suggestions.
but I think there is also a decision that we need to make before that:
> Should Iceberg require writers to maintain the partition stats?
I think I would prefer to take a lazy approach and not assume that writers
> will keep the partition stats
Thanks for writing this up, Ajantha! I think that we have all the upstream
pieces in place to work on this so it's great to have a proposal.
The proposal does a good job of summarizing the choices for how to store
the data, but I think there is also a decision that we need to make before
that: Sho
Thanks Piotr for taking a look at it.
I have replied to all the comments in the document.
I might need your support in standardising the existing `StatisticsFile`
interface to adopt partition stats as mentioned in the design.
*We do need more eyes on the design. Once I get approval for the desig
Hi Ajantha,
this is very interesting document, thank you for your work on this!
I've added a few comments there.
I have one high-level design comment so I thought it would be nicer to
everyone if I re-post it here
is "partition" the right level of keeping the stats?
> We do this in Hive, but was
Hi Community,
I did a proposal write-up for the partition stats in Iceberg.
Please have a look and let me know what you think. I would like to work on
it.
https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing
Requirement background snippet from the above
29 matches
Mail list logo