Re: [Proposal] Partition stats in Iceberg

2023-10-11 Thread Ajantha Bhat
; >> writing in the long run. > > >> > > >> Thanks, > > >> Ajantha > > >> > > >> On Wed, May 17, 2023 at 1:38 AM Mayur Srivastava < > > >> mayur.srivast...@twosigma.com> wrote: > > >> > > >>> I agree, it tot

Re: [Proposal] Partition stats in Iceberg

2023-10-11 Thread Anton Okolnychyi
dified time” per > >>> partition is implemented. > >>> > >>> I’m concerned about performance of computing partition stats (and > >>> storage + the size of table metadata files) if the implementation requires > >>> users to keep around all snapsh

Re: [Proposal] Partition stats in Iceberg

2023-10-11 Thread Ajantha Bhat
agree, it totally depends on the way “last modified time” per >>> partition is implemented. >>> >>> I’m concerned about performance of computing partition stats (and >>> storage + the size of table metadata files) if the implementation requires >>> users to k

Re: [Proposal] Partition stats in Iceberg

2023-05-22 Thread Ryan Blue
erned about performance of computing partition stats (and storage >> + the size of table metadata files) if the implementation requires users to >> keep around all snapshots. (I described one of my use case in this thread >> earlier.) >> >> >> >> *Fr

Re: [Proposal] Partition stats in Iceberg

2023-05-22 Thread Ajantha Bhat
he implementation requires users to > keep around all snapshots. (I described one of my use case in this thread > earlier.) > > > > *From:* Pucheng Yang > *Sent:* Monday, May 15, 2023 11:46 AM > *To:* dev@iceberg.apache.org > *Subject:* Re: [Proposal] Partition stats in

RE: [Proposal] Partition stats in Iceberg

2023-05-16 Thread Mayur Srivastava
case in this thread earlier.) From: Pucheng Yang Sent: Monday, May 15, 2023 11:46 AM To: dev@iceberg.apache.org Subject: Re: [Proposal] Partition stats in Iceberg Hi Mayur, can you elaborate your concern? I don't know how this is going to be implemented so not sure where the performance iss

Re: [Proposal] Partition stats in Iceberg

2023-05-15 Thread Pucheng Yang
*From:* Ryan Blue > *Sent:* Wednesday, May 3, 2023 2:00 PM > *To:* dev@iceberg.apache.org > *Subject:* Re: [Proposal] Partition stats in Iceberg > > > > Mayur, your use case may require a lot of snapshots, but we generally > recommend expiring them after a few days. You ca

RE: [Proposal] Partition stats in Iceberg

2023-05-15 Thread Mayur Srivastava
Blue Sent: Wednesday, May 3, 2023 2:00 PM To: dev@iceberg.apache.org Subject: Re: [Proposal] Partition stats in Iceberg Mayur, your use case may require a lot of snapshots, but we generally recommend expiring them after a few days. You can tag snapshots to keep them around longer than that. On

Re: [Proposal] Partition stats in Iceberg

2023-05-03 Thread Ryan Blue
gt;>> use the change log (CDC) but I think that is too heavy (I guess, since it >>>>>> requires to run SparkSQL procedure) and it is over do the work (I don't >>>>>> need what rows are changed, I just need true or false for whether a >>>>>

Re: [Proposal] Partition stats in Iceberg

2023-05-02 Thread Mayur Srivastava
;>>> requires to run SparkSQL procedure) and it is over do the work (I don't >>>>> need what rows are changed, I just need true or false for whether a >>>>> partition is changed). >>>>> >>>>> Thanks >>>>> >>>>

Re: [Proposal] Partition stats in Iceberg

2023-05-02 Thread Szehon Ho
gt;> >>>>> Thanks Ajantha. >>>>> >>>>> >>>>> >>>>> > It should be very easy to add a few more fields to it like the >>>>> latest sequence number or last modified time per partition. >>>>> >>>>

Re: [Proposal] Partition stats in Iceberg

2023-05-02 Thread Pucheng Yang
t;>> >>>> Thanks Ajantha. >>>> >>>> >>>> >>>> > It should be very easy to add a few more fields to it like the latest >>>> sequence number or last modified time per partition. >>>> >>>> >>&

Re: [Proposal] Partition stats in Iceberg

2023-05-02 Thread Mayur Srivastava
t; add a few more fields to it like the latest sequence number or last >>>> modified time per partition. >>>> I will be opening up the discussion about phase 2 schema again once >>>> phase 1 implementation is done. >>>> >>>> Thanks, >&

Re: [Proposal] Partition stats in Iceberg

2023-05-02 Thread Szehon Ho
>>> likely to be available in Iceberg partition stats? Note that we would like >>> to avoid compaction change the sequence number or modified time stats. >>> >>> >>> >>> Thanks, >>> >>> Mayur >>> >>> >>>

Re: [Proposal] Partition stats in Iceberg

2023-05-02 Thread Ryan Blue
e the sequence number or modified time stats. >> >> >> >> Thanks, >> >> Mayur >> >> >> >> *From:* Ajantha Bhat >> *Sent:* Tuesday, February 7, 2023 10:02 AM >> *To:* dev@iceberg.apache.org >> *Subject:* Re: [Proposal] Parti

Re: [Proposal] Partition stats in Iceberg

2023-04-30 Thread Ajantha Bhat
change the sequence number or modified time stats. >> >> >> >> Thanks, >> >> Mayur >> >> >> >> *From:* Ajantha Bhat >> *Sent:* Tuesday, February 7, 2023 10:02 AM >> *To:* dev@iceberg.apache.org >> *Subject:* Re: [Proposal

Re: [Proposal] Partition stats in Iceberg

2023-04-28 Thread Pucheng Yang
in Iceberg partition stats? Note that we would like > to avoid compaction change the sequence number or modified time stats. > > > > Thanks, > > Mayur > > > > *From:* Ajantha Bhat > *Sent:* Tuesday, February 7, 2023 10:02 AM > *To:* dev@iceberg.apache.org &g

RE: [Proposal] Partition stats in Iceberg

2023-02-07 Thread Mayur Srivastava
ike to avoid compaction change the sequence number or modified time stats. Thanks, Mayur From: Ajantha Bhat Sent: Tuesday, February 7, 2023 10:02 AM To: dev@iceberg.apache.org Subject: Re: [Proposal] Partition stats in Iceberg Hi Hrishi and Mayur, thanks for the inputs. To get things moving

Re: [Proposal] Partition stats in Iceberg

2023-02-07 Thread Ajantha Bhat
cy requirements. > > > > Is partition stats a good place for storing last-modified-time per > partition? > > > > Thanks, > > Mayur > > > > *From:* Ajantha Bhat > *Sent:* Monday, January 23, 2023 11:56 AM > *To:* dev@iceberg.apache.org >

RE: [Proposal] Partition stats in Iceberg

2023-02-07 Thread Mayur Srivastava
-time per partition? Thanks, Mayur From: Ajantha Bhat Sent: Monday, January 23, 2023 11:56 AM To: dev@iceberg.apache.org Subject: Re: [Proposal] Partition stats in Iceberg Hi All, In the same design document (https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit

Re: [Proposal] Partition stats in Iceberg

2023-01-23 Thread Ajantha Bhat
Hi All, In the same design document ( https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing ), I have added a section called *"Design for approval". *It also contains a potential PR breakdown for the phase 1 implementation and future development scope. Pl

Re: [Proposal] Partition stats in Iceberg

2022-12-05 Thread Ajantha Bhat
A big thanks to everyone who was involved in the review and the discussions so far. Please find the meeting minutes from the last iceberg sync about the partition stats. a. Writers should not write the partition stats or any stats as of now. Because it requires bumping the spec to V3.

Re: [Proposal] Partition stats in Iceberg

2022-11-25 Thread Ajantha Bhat
Hi Ryan, are you saying that you think the partition-level stats should not be > required? I think that would be best. I think there is some confusion here. Partition-level stats are required (hence the proposal). But does the writer always write it? (with the append/delete/replace operation) or

Re: [Proposal] Partition stats in Iceberg

2022-11-25 Thread Ryan Blue
Ajantha, are you saying that you think the partition-level stats should not be required? I think that would be best. I’m all for improving the interface for retrieving stats. It’s a separate issue, but I think that Iceberg should provide both access to the Puffin files and metadata as well as a hi

Re: [Proposal] Partition stats in Iceberg

2022-11-24 Thread Ajantha Bhat
Hi Ryan, Thanks a lot for the review and suggestions. but I think there is also a decision that we need to make before that: > Should Iceberg require writers to maintain the partition stats? I think I would prefer to take a lazy approach and not assume that writers > will keep the partition stats

Re: [Proposal] Partition stats in Iceberg

2022-11-23 Thread Ryan Blue
Thanks for writing this up, Ajantha! I think that we have all the upstream pieces in place to work on this so it's great to have a proposal. The proposal does a good job of summarizing the choices for how to store the data, but I think there is also a decision that we need to make before that: Sho

Re: [Proposal] Partition stats in Iceberg

2022-11-23 Thread Ajantha Bhat
Thanks Piotr for taking a look at it. I have replied to all the comments in the document. I might need your support in standardising the existing `StatisticsFile` interface to adopt partition stats as mentioned in the design. *We do need more eyes on the design. Once I get approval for the desig

Re: [Proposal] Partition stats in Iceberg

2022-11-23 Thread Piotr Findeisen
Hi Ajantha, this is very interesting document, thank you for your work on this! I've added a few comments there. I have one high-level design comment so I thought it would be nicer to everyone if I re-post it here is "partition" the right level of keeping the stats? > We do this in Hive, but was

[Proposal] Partition stats in Iceberg

2022-11-14 Thread Ajantha Bhat
Hi Community, I did a proposal write-up for the partition stats in Iceberg. Please have a look and let me know what you think. I would like to work on it. https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk/edit?usp=sharing Requirement background snippet from the above