Re: Proposal: Introduce deletion vector file to reduce write amplification

2023-10-11 Thread Renjie Liu
Hi, Russell:


> The main things I’m still interested are alternative approaches. I think
> that some of the work that Anton is working on have shown some different
> bottlenecks in applying delete files that I’m not sure are addressed by
> this proposal.


I'm also interested. Could you share some resources on the work that Anton
is working? I didn't notice that.

For example, this proposal suggests doing a 1 to 1 (or 1 rowgroup to 1)
> delete file application in order to speed up planning. But this could as be
> done with a puffin file indexing delete files to data files. This would
> eliminate any planning cost while also allowing us to do more complicated
> things like mapping multiple data files to a single delete file as well as
> operate on a one to many data file to delete file approach. Doing this
> approach would mean we would need to change any existing metadata or
> introduce a new separate file type.


Yes, we can improve planning performance by embedding the mapping in a
puffin file. But I guess this may introduce other problems like conflicting
when doing commits? IIUC, puffin file is used as table level index or
statistics.

I would also expect some POC experiments showing that the Spec is getting
> the benefit’s that are hypothesized.


 I will conduct some poc experiments with actual data, but it may take some
time to implement it.

The proposal I think also needs to address any possible limitations with
> this approach. They don’t all need to be solved but we should at least
> being exploring them. As a quick example, how does using single delete
> files interact with our commit logic? I would guess that a single delete
> file approach would make it more difficult to perform multiple deletes
> concurrently?


Good suggestion, I'm working on updating the doc to completement sketches
for dml operations. IIUC, for potential conflicts in performing multiple
deletes concurrently, you mean concurrent writes from different dml jobs?
If so, I think the current solution still has the same problem since this
is in fact conflicts from concurrent updates. But I do admit that the
deletion vector approach makes confliction easier since it's file level.



On Tue, Oct 10, 2023 at 8:54 AM Russell Spitzer 
wrote:

> The main things I’m still interested are alternative approaches. I think
> that some of the work that Anton is working on have shown some different
> bottlenecks in applying delete files that I’m not sure are addressed by
> this proposal.
>
> For example, this proposal suggests doing a 1 to 1 (or 1 rowgroup to 1)
> delete file application in order to speed up planning. But this could as be
> done with a puffin file indexing delete files to data files. This would
> eliminate any planning cost while also allowing us to do more complicated
> things like mapping multiple data files to a single delete file as well as
> operate on a one to many data file to delete file approach. Doing this
> approach would mean we would need to change any existing metadata or
> introduce a new separate file type.
>
> I think basically for every “benefit” outlined we should think about if
> there is an alternative approach that would achieve the same benefit. Then
> we should analyze or whether or not the proposal is the best solution for
> that particular benefit and do some work to calculate what that benefit
> would be and what drawbacks there might be.
>
> I would also expect some POC experiments showing that the Spec is getting
> the benefit’s that are hypothesized.
>
> The proposal I think also needs to address any possible limitations with
> this approach. They don’t all need to be solved but we should at least
> being exploring them. As a quick example, how does using single delete
> files interact with our commit logic? I would guess that a single delete
> file approach would make it more difficult to perform multiple deletes
> concurrently?
>
> Sent from my iPad
>
> On Oct 8, 2023, at 9:22 PM, Renjie Liu  wrote:
>
> 
> Hi, Ryan:
> Thanks for your reply.
>
> 1. What is the exact file format for these on disk that you're proposing?
>> Even if you're saying that it is what is produced by roaring bitmap, we
>> need more information. Is that a portable format? Do you wrap it at all in
>> the file to carry extra metadata? For example, the proposal says that a
>> starting position for a bitmap would be used. Where is that stored?
>
>
> Sorry for the confusion, by file format I mean roaring bitmap's file
> format .
> I checked that it has been implemented in several languages, such as java,
> go, rust, c. Metadata will be stored in manifest file as other entries such
> as datafile, deletion file. The starting position doesn't need to be stored
> since it's used by the file reader. I think your suggestion to provide an
> interface in design will make things clearer, and I will add it to the
> design doc.
>
> 2. How would DML operations work? Just a

Re: Scan column metrics

2023-10-11 Thread Péter Váry
Based on our discussion here, I have created a PR for the feature:
https://github.com/apache/iceberg/pull/8803

I think this is not a big change, and the flexibility/reduced memory
consumption would be worth the additional complexity.

Please review the PR to see for yourselves :)

Thanks,
Peter

Manish Malhotra  ezt írta (időpont: 2023.
okt. 10., K, 17:02):

> Thanks Ryan,
>
> Good point, that makes sense.
>
> Though +1 for the feature.
>
> We can avoid it during ingestion as well, though we might need the stats
> some time later, so having options during reading will help.
>
> Thanks,
> Manish
>
> On Mon, Oct 9, 2023 at 4:38 PM Ryan Blue  wrote:
>
>> For that use case, it sounds like you'd be much better off not storing
>> all the stats rather that skipping them at read time. I understand the user
>> wants to keep them, but it may still not be a great choice. I'm just
>> worried that this is going to be a lot of effort for you that doesn't
>> really generalize.
>>
>> That said, if you're convinced that this is the right path I think it
>> would be _nice_ to have it in. We can always reduce memory consumption!
>>
>> On Mon, Oct 9, 2023 at 5:21 AM Péter Váry 
>> wrote:
>>
>>> The owner of the table wanted to keep the column stats for all of the
>>> columns, claiming that other users might/are using the statistics of the
>>> columns. Even if I am not sure that their case was defendable, I think the
>>> reader of the table is often not in the position to optimize the table for
>>> their own usage. Even in this case we should provide tools for them to
>>> archive as much as possible.
>>>
>>> Since the statistics values are stored in the BaseFile in a map, every
>>> column (~400 in our case) added an entry (100 bytes) to all of the stat
>>> fields
>>> (columnSizes/valueCounts/nullValueCounts/nanValueCounts/lowerBounds/upperBounds
>>> - 6x100 = 600 bytes). As a result we ended up having GenericDataFile
>>> costing us 120k (some columns did not have all of the stats present).
>>> Almost all of this was statistics related, and most of it is totally
>>> unnecessary for our reader. Additionally, we had 67k splits, and the result
>>> was that we had 8G unnecessary memory consumption.
>>>
>>> I think if we implement the filtering in the BaseFile constructor, where
>>> we do the column stats removal, then we can create a reasonably fast copy
>>> of the map which contains only the required column stats. Also in this case
>>> we could trade CPU for memory consumption.
>>>
>>> WDYT?
>>>
>>> Thanks,
>>> Peter
>>>
>>> Steven Wu  ezt írta (időpont: 2023. okt. 7., Szo,
>>> 17:04):
>>>
 I am sure dropping column stats can be helpful. Just that it has some
 challenges in practice. It requires table owners to know the query pattern
 and decide what column stats to keep and what to drop. While automation can
 help ease the decisions based on the query history, it can't predict future
 usages. When new column stats become useful later, we would need to
 backfill/rebuild the new column stats for existing data files.

 Engines (like Trino) know which columns are used in filter and join
 expression. Query engines precisely know what column stats are needed.

 On Fri, Oct 6, 2023 at 8:59 AM Ryan Blue  wrote:

> I understand wanting to keep more in general, that's why we have the
> 100 column threshold set fairly high. But in the case you're describing
> those column stats are causing a problem. I'd expect you to be able to 
> drop
> some of them on such a large table to solve the problem, rather than 
> filter
> them out (which is actually not very quick anyway). Have you tried that?
> Why doesn't it work?
>
> On Thu, Oct 5, 2023 at 7:57 PM Steven Wu  wrote:
>
>> It is definitely good to only track column stats that are used.
>> Otherwise, we are just creating wasteful metadata that can increase
>> manifest file size and slow down scan planning. If a table has 100 
>> columns,
>> it is very unlikely we need stats for all columns.
>>
>> But in practice, it is a bit difficult to predict which column stats
>> will be useful for future queries. We can analyze query patterns and
>> autotune the decision of selectively enabling column stats. However,
>>  if we need to enable new column stats, backfilling column stats for
>> existing data files would require expensive rewrites. Note that Iceberg's
>> default behavior is to track column stats up to the first 100 columns.
>>
>>
>> Interesting to learn from the community regarding column stats tuning.
>>
>> On Thu, Oct 5, 2023 at 5:38 PM Ryan Blue  wrote:
>>
>>> I can think of situations in which you may want something like this,
>>> but I'm curious what other options you've used to solve the problem. 
>>> This
>>> seems like exactly what `write.metadata.metrics.*` was intended to solve
>>> and I'm a bit surprise

Re: Proposal: Introduce deletion vector file to reduce write amplification

2023-10-11 Thread Anton Okolnychyi
I tried to summarize notes from our previous discussions here:
https://docs.google.com/document/d/1M4L6o-qnGRwGhbhkW8BnravoTwvCrJV8VvzVQDRJO5I/

I am going to iterate on the doc later today.

On 2023/10/11 07:06:07 Renjie Liu wrote:
> Hi, Russell:
> 
> 
> > The main things I’m still interested are alternative approaches. I think
> > that some of the work that Anton is working on have shown some different
> > bottlenecks in applying delete files that I’m not sure are addressed by
> > this proposal.
> 
> 
> I'm also interested. Could you share some resources on the work that Anton
> is working? I didn't notice that.
> 
> For example, this proposal suggests doing a 1 to 1 (or 1 rowgroup to 1)
> > delete file application in order to speed up planning. But this could as be
> > done with a puffin file indexing delete files to data files. This would
> > eliminate any planning cost while also allowing us to do more complicated
> > things like mapping multiple data files to a single delete file as well as
> > operate on a one to many data file to delete file approach. Doing this
> > approach would mean we would need to change any existing metadata or
> > introduce a new separate file type.
> 
> 
> Yes, we can improve planning performance by embedding the mapping in a
> puffin file. But I guess this may introduce other problems like conflicting
> when doing commits? IIUC, puffin file is used as table level index or
> statistics.
> 
> I would also expect some POC experiments showing that the Spec is getting
> > the benefit’s that are hypothesized.
> 
> 
>  I will conduct some poc experiments with actual data, but it may take some
> time to implement it.
> 
> The proposal I think also needs to address any possible limitations with
> > this approach. They don’t all need to be solved but we should at least
> > being exploring them. As a quick example, how does using single delete
> > files interact with our commit logic? I would guess that a single delete
> > file approach would make it more difficult to perform multiple deletes
> > concurrently?
> 
> 
> Good suggestion, I'm working on updating the doc to completement sketches
> for dml operations. IIUC, for potential conflicts in performing multiple
> deletes concurrently, you mean concurrent writes from different dml jobs?
> If so, I think the current solution still has the same problem since this
> is in fact conflicts from concurrent updates. But I do admit that the
> deletion vector approach makes confliction easier since it's file level.
> 
> 
> 
> On Tue, Oct 10, 2023 at 8:54 AM Russell Spitzer 
> wrote:
> 
> > The main things I’m still interested are alternative approaches. I think
> > that some of the work that Anton is working on have shown some different
> > bottlenecks in applying delete files that I’m not sure are addressed by
> > this proposal.
> >
> > For example, this proposal suggests doing a 1 to 1 (or 1 rowgroup to 1)
> > delete file application in order to speed up planning. But this could as be
> > done with a puffin file indexing delete files to data files. This would
> > eliminate any planning cost while also allowing us to do more complicated
> > things like mapping multiple data files to a single delete file as well as
> > operate on a one to many data file to delete file approach. Doing this
> > approach would mean we would need to change any existing metadata or
> > introduce a new separate file type.
> >
> > I think basically for every “benefit” outlined we should think about if
> > there is an alternative approach that would achieve the same benefit. Then
> > we should analyze or whether or not the proposal is the best solution for
> > that particular benefit and do some work to calculate what that benefit
> > would be and what drawbacks there might be.
> >
> > I would also expect some POC experiments showing that the Spec is getting
> > the benefit’s that are hypothesized.
> >
> > The proposal I think also needs to address any possible limitations with
> > this approach. They don’t all need to be solved but we should at least
> > being exploring them. As a quick example, how does using single delete
> > files interact with our commit logic? I would guess that a single delete
> > file approach would make it more difficult to perform multiple deletes
> > concurrently?
> >
> > Sent from my iPad
> >
> > On Oct 8, 2023, at 9:22 PM, Renjie Liu  wrote:
> >
> > 
> > Hi, Ryan:
> > Thanks for your reply.
> >
> > 1. What is the exact file format for these on disk that you're proposing?
> >> Even if you're saying that it is what is produced by roaring bitmap, we
> >> need more information. Is that a portable format? Do you wrap it at all in
> >> the file to carry extra metadata? For example, the proposal says that a
> >> starting position for a bitmap would be used. Where is that stored?
> >
> >
> > Sorry for the confusion, by file format I mean roaring bitmap's file
> > format .
> > I

Re: [Proposal] Partition stats in Iceberg

2023-10-11 Thread Ajantha Bhat
Hi All,

As per the above proposal, I have worked on a POC (
https://github.com/apache/iceberg/pull/8488).

*But to move things forward, first we need to merge the spec PR
(https://github.com/apache/iceberg/pull/7105
). *I don't see any blocker
for the spec. Please review and approve if it is ok.

One topic that came up during the review is whether to write stats
synchronously or asynchronously.
My suggestion is that we need to support both. I think we can first have an
async writing implementation.
But we also need to support sync stats writing with writes (controlled by a
table property).

Some engines like Trino, Dremio can make use of the sync writing of the
stats.
Currently Puffin stats also supports sync writing from Trino.

Thanks,
Ajantha


On Mon, May 22, 2023 at 10:15 PM Ryan Blue  wrote:

> Thanks, Ajantha. I think it's safe to say that we should continue assuming
> that we will have one partition stats file. I agree that it should be small
> and we don't want to block the progress here.
>
> On Mon, May 22, 2023 at 5:07 AM Ajantha Bhat 
> wrote:
>
>> Hi Anton and Ryan,
>>
>> The Partition stats spec PR  
>> didn't
>> move forward as Anton wanted to conduct some experiments to conclude
>> whether single-file writing or multiple files is better.
>> I conducted the experiments myself and attached some numbers in the PR.
>>
>> I would like to take this forward.
>> Please let me know what you think (can comment on the PR).
>>
>> As the output file is very small and initially the stats are computed
>> asynchronously,
>> I think writing them as a single file should be good enough.
>> In future, If we need faster stats writing (along with each write
>> operation) we can also implement multiple stats files.
>>
>> Just like how copy-on-write and merge-on-read are serving their use cases
>> in Iceberg,
>> we might have to support both single-file writing and multiple-file
>> writing in the long run.
>>
>> Thanks,
>> Ajantha
>>
>> On Wed, May 17, 2023 at 1:38 AM Mayur Srivastava <
>> mayur.srivast...@twosigma.com> wrote:
>>
>>> I agree, it totally depends on the way “last modified time” per
>>> partition is implemented.
>>>
>>> I’m concerned about performance of computing partition stats (and
>>> storage + the size of table metadata files) if the implementation requires
>>> users to keep around all snapshots. (I described one of my use case in this
>>> thread earlier.)
>>>
>>>
>>>
>>> *From:* Pucheng Yang 
>>> *Sent:* Monday, May 15, 2023 11:46 AM
>>> *To:* dev@iceberg.apache.org
>>> *Subject:* Re: [Proposal] Partition stats in Iceberg
>>>
>>>
>>>
>>> Hi Mayur, can you elaborate your concern? I don't know how this is going
>>> to be implemented so not sure where the performance issue is.
>>>
>>>
>>>
>>> On Mon, May 15, 2023 at 7:55 AM Mayur Srivastava <
>>> mayur.srivast...@twosigma.com> wrote:
>>>
>>> Thanks Ryan.
>>>
>>> For most partition stats, I’m ok with compaction and keeping fewer
>>> snapshots. My concern was for supporting last modified time. I guess, if we
>>> need to keep all snapshots to support last modified time, it could have
>>> impact on metadata access performance.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
>>> *From:* Ryan Blue 
>>> *Sent:* Wednesday, May 3, 2023 2:00 PM
>>> *To:* dev@iceberg.apache.org
>>> *Subject:* Re: [Proposal] Partition stats in Iceberg
>>>
>>>
>>>
>>> Mayur, your use case may require a lot of snapshots, but we generally
>>> recommend expiring them after a few days. You can tag snapshots to keep
>>> them around longer than that.
>>>
>>>
>>>
>>> On Tue, May 2, 2023 at 4:52 PM Mayur Srivastava <
>>> mayur.p.srivast...@gmail.com> wrote:
>>>
>>> Thanks for the response.
>>>
>>> One of the use cases that we have is where one business day of data is
>>> added at a time to a DAY partitioned table. With 25 years of this data,
>>> there will be ~6250 partitions and as many snapshots. Are these many
>>> snapshots recommended to be kept around?
>>>
>>>
>>>
>>> On Tue, May 2, 2023 at 7:45 PM Szehon Ho 
>>> wrote:
>>>
>>>
>>>
>>> Does snapshot expiration needs to be disabled for this to work? Thanks,
>>> Mayur
>>>
>>>
>>> Yes, the snapshot that last updated the partition needs to be around for
>>> this to work.
>>>
>>>
>>>
>>>  Szehon, the query you shared requires a SparkSQL job to be run which
>>> means latency will be high. However, I am glad you are also thinking of
>>> adding these directly to the partition table and it seems we share the same
>>> interests.
>>>
>>>
>>> Yea the partitions table currently still goes through SparkSQL, so it
>>> will be the same.  Maybe you mean add this to partition stats?  We do need
>>> to reconcile partition table and partition stats at some point though.  Not
>>> sure if it was designed/discussed yet, I think there was some thoughts on
>>> short-circuiting Partitions table to read from Partition stats, if stats
>>> exist for the current snaps

Re: [Proposal] Partition stats in Iceberg

2023-10-11 Thread Anton Okolnychyi
I think the question is what we mean by doing this synchronously. For instance, 
I have doubts it would be a good idea to do this in each commit attempt unless 
we can prove the overhead is negligible with a benchmark. We risk failing a job 
for the sake of updating partition stats. I can see the need to update the 
stats immediately after an operation. However, why not commit first and then do 
a post-commit action to update the stats? If succeeds, good. If not, we will 
try again next time.

On 2023/10/11 17:14:12 Ajantha Bhat wrote:
> Hi All,
> 
> As per the above proposal, I have worked on a POC (
> https://github.com/apache/iceberg/pull/8488).
> 
> *But to move things forward, first we need to merge the spec PR
> (https://github.com/apache/iceberg/pull/7105
> ). *I don't see any blocker
> for the spec. Please review and approve if it is ok.
> 
> One topic that came up during the review is whether to write stats
> synchronously or asynchronously.
> My suggestion is that we need to support both. I think we can first have an
> async writing implementation.
> But we also need to support sync stats writing with writes (controlled by a
> table property).
> 
> Some engines like Trino, Dremio can make use of the sync writing of the
> stats.
> Currently Puffin stats also supports sync writing from Trino.
> 
> Thanks,
> Ajantha
> 
> 
> On Mon, May 22, 2023 at 10:15 PM Ryan Blue  wrote:
> 
> > Thanks, Ajantha. I think it's safe to say that we should continue assuming
> > that we will have one partition stats file. I agree that it should be small
> > and we don't want to block the progress here.
> >
> > On Mon, May 22, 2023 at 5:07 AM Ajantha Bhat 
> > wrote:
> >
> >> Hi Anton and Ryan,
> >>
> >> The Partition stats spec PR  
> >> didn't
> >> move forward as Anton wanted to conduct some experiments to conclude
> >> whether single-file writing or multiple files is better.
> >> I conducted the experiments myself and attached some numbers in the PR.
> >>
> >> I would like to take this forward.
> >> Please let me know what you think (can comment on the PR).
> >>
> >> As the output file is very small and initially the stats are computed
> >> asynchronously,
> >> I think writing them as a single file should be good enough.
> >> In future, If we need faster stats writing (along with each write
> >> operation) we can also implement multiple stats files.
> >>
> >> Just like how copy-on-write and merge-on-read are serving their use cases
> >> in Iceberg,
> >> we might have to support both single-file writing and multiple-file
> >> writing in the long run.
> >>
> >> Thanks,
> >> Ajantha
> >>
> >> On Wed, May 17, 2023 at 1:38 AM Mayur Srivastava <
> >> mayur.srivast...@twosigma.com> wrote:
> >>
> >>> I agree, it totally depends on the way “last modified time” per
> >>> partition is implemented.
> >>>
> >>> I’m concerned about performance of computing partition stats (and
> >>> storage + the size of table metadata files) if the implementation requires
> >>> users to keep around all snapshots. (I described one of my use case in 
> >>> this
> >>> thread earlier.)
> >>>
> >>>
> >>>
> >>> *From:* Pucheng Yang 
> >>> *Sent:* Monday, May 15, 2023 11:46 AM
> >>> *To:* dev@iceberg.apache.org
> >>> *Subject:* Re: [Proposal] Partition stats in Iceberg
> >>>
> >>>
> >>>
> >>> Hi Mayur, can you elaborate your concern? I don't know how this is going
> >>> to be implemented so not sure where the performance issue is.
> >>>
> >>>
> >>>
> >>> On Mon, May 15, 2023 at 7:55 AM Mayur Srivastava <
> >>> mayur.srivast...@twosigma.com> wrote:
> >>>
> >>> Thanks Ryan.
> >>>
> >>> For most partition stats, I’m ok with compaction and keeping fewer
> >>> snapshots. My concern was for supporting last modified time. I guess, if 
> >>> we
> >>> need to keep all snapshots to support last modified time, it could have
> >>> impact on metadata access performance.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Mayur
> >>>
> >>>
> >>>
> >>> *From:* Ryan Blue 
> >>> *Sent:* Wednesday, May 3, 2023 2:00 PM
> >>> *To:* dev@iceberg.apache.org
> >>> *Subject:* Re: [Proposal] Partition stats in Iceberg
> >>>
> >>>
> >>>
> >>> Mayur, your use case may require a lot of snapshots, but we generally
> >>> recommend expiring them after a few days. You can tag snapshots to keep
> >>> them around longer than that.
> >>>
> >>>
> >>>
> >>> On Tue, May 2, 2023 at 4:52 PM Mayur Srivastava <
> >>> mayur.p.srivast...@gmail.com> wrote:
> >>>
> >>> Thanks for the response.
> >>>
> >>> One of the use cases that we have is where one business day of data is
> >>> added at a time to a DAY partitioned table. With 25 years of this data,
> >>> there will be ~6250 partitions and as many snapshots. Are these many
> >>> snapshots recommended to be kept around?
> >>>
> >>>
> >>>
> >>> On Tue, May 2, 2023 at 7:45 PM Szehon Ho 
> >>> wrote:
> >>>
> >>>
> >>>
> >>> Does snapshot expiration needs to be disabled f

Re: Proposal: Introduce deletion vector file to reduce write amplification

2023-10-11 Thread Renjie Liu
Hi, Anton:
I've gone through the doc, and we are trying to solve the same problems of
position deletes, but with different approaches. It's quite interesting.

On Thu, Oct 12, 2023 at 12:11 AM Anton Okolnychyi 
wrote:

> I tried to summarize notes from our previous discussions here:
>
> https://docs.google.com/document/d/1M4L6o-qnGRwGhbhkW8BnravoTwvCrJV8VvzVQDRJO5I/
>
> I am going to iterate on the doc later today.
>
> On 2023/10/11 07:06:07 Renjie Liu wrote:
> > Hi, Russell:
> >
> >
> > > The main things I’m still interested are alternative approaches. I
> think
> > > that some of the work that Anton is working on have shown some
> different
> > > bottlenecks in applying delete files that I’m not sure are addressed by
> > > this proposal.
> >
> >
> > I'm also interested. Could you share some resources on the work that
> Anton
> > is working? I didn't notice that.
> >
> > For example, this proposal suggests doing a 1 to 1 (or 1 rowgroup to 1)
> > > delete file application in order to speed up planning. But this could
> as be
> > > done with a puffin file indexing delete files to data files. This would
> > > eliminate any planning cost while also allowing us to do more
> complicated
> > > things like mapping multiple data files to a single delete file as
> well as
> > > operate on a one to many data file to delete file approach. Doing this
> > > approach would mean we would need to change any existing metadata or
> > > introduce a new separate file type.
> >
> >
> > Yes, we can improve planning performance by embedding the mapping in a
> > puffin file. But I guess this may introduce other problems like
> conflicting
> > when doing commits? IIUC, puffin file is used as table level index or
> > statistics.
> >
> > I would also expect some POC experiments showing that the Spec is getting
> > > the benefit’s that are hypothesized.
> >
> >
> >  I will conduct some poc experiments with actual data, but it may take
> some
> > time to implement it.
> >
> > The proposal I think also needs to address any possible limitations with
> > > this approach. They don’t all need to be solved but we should at least
> > > being exploring them. As a quick example, how does using single delete
> > > files interact with our commit logic? I would guess that a single
> delete
> > > file approach would make it more difficult to perform multiple deletes
> > > concurrently?
> >
> >
> > Good suggestion, I'm working on updating the doc to completement sketches
> > for dml operations. IIUC, for potential conflicts in performing multiple
> > deletes concurrently, you mean concurrent writes from different dml jobs?
> > If so, I think the current solution still has the same problem since this
> > is in fact conflicts from concurrent updates. But I do admit that the
> > deletion vector approach makes confliction easier since it's file level.
> >
> >
> >
> > On Tue, Oct 10, 2023 at 8:54 AM Russell Spitzer <
> russell.spit...@gmail.com>
> > wrote:
> >
> > > The main things I’m still interested are alternative approaches. I
> think
> > > that some of the work that Anton is working on have shown some
> different
> > > bottlenecks in applying delete files that I’m not sure are addressed by
> > > this proposal.
> > >
> > > For example, this proposal suggests doing a 1 to 1 (or 1 rowgroup to 1)
> > > delete file application in order to speed up planning. But this could
> as be
> > > done with a puffin file indexing delete files to data files. This would
> > > eliminate any planning cost while also allowing us to do more
> complicated
> > > things like mapping multiple data files to a single delete file as
> well as
> > > operate on a one to many data file to delete file approach. Doing this
> > > approach would mean we would need to change any existing metadata or
> > > introduce a new separate file type.
> > >
> > > I think basically for every “benefit” outlined we should think about if
> > > there is an alternative approach that would achieve the same benefit.
> Then
> > > we should analyze or whether or not the proposal is the best solution
> for
> > > that particular benefit and do some work to calculate what that benefit
> > > would be and what drawbacks there might be.
> > >
> > > I would also expect some POC experiments showing that the Spec is
> getting
> > > the benefit’s that are hypothesized.
> > >
> > > The proposal I think also needs to address any possible limitations
> with
> > > this approach. They don’t all need to be solved but we should at least
> > > being exploring them. As a quick example, how does using single delete
> > > files interact with our commit logic? I would guess that a single
> delete
> > > file approach would make it more difficult to perform multiple deletes
> > > concurrently?
> > >
> > > Sent from my iPad
> > >
> > > On Oct 8, 2023, at 9:22 PM, Renjie Liu 
> wrote:
> > >
> > > 
> > > Hi, Ryan:
> > > Thanks for your reply.
> > >
> > > 1. What is the exact file format for these on disk that you're
> proposing?
>

Re: [Proposal] Partition stats in Iceberg

2023-10-11 Thread Ajantha Bhat
>
> I think the question is what we mean by doing this synchronously. For
> instance, I have doubts it would be a good idea to do this in each commit
> attempt unless we can prove the overhead is negligible with a benchmark.


Yeah, I can share the benchmarks. As I also mentioned, synchronous writing
will be controlled by a table property.
If the users think it will impact their write performance, they can disable
it and compute stats only asynchronously.


> We risk failing a job for the sake of updating partition stats.


I think for synchronous writing no need to fail the job if stats write is
failed. Users can asynchronously generate the stats again
if the stats file is not registered in table metadata.

I can see the need to update the stats immediately after an operation.
> However, why not commit first and then do a post-commit action to update
> the stats? If succeeds, good. If not, we will try again next time.


This can be done. But we have to do an extra IO of reading the manifests
again (because there can be too many partition values per operation and we
cannot hold the partition stats info in snapshot summary as it can bloat up
the table metadata). Also, having a new commit from a post commit action
may be confusing? because some commits will create stats and some commits
will not. Plus keeping concurrent commits in mind, the state of the table
might be different when we are committing just the stats.

All these points are regarding the synchronous write, which we can finalize
by discussing more.
But spec and async writing doesn't change from any of these. So, I suggest
we can first merge them and go ahead with the implementation.

Thanks,
Ajantha

On Thu, Oct 12, 2023 at 6:44 AM Anton Okolnychyi 
wrote:

> I think the question is what we mean by doing this synchronously. For
> instance, I have doubts it would be a good idea to do this in each commit
> attempt unless we can prove the overhead is negligible with a benchmark. We
> risk failing a job for the sake of updating partition stats. I can see the
> need to update the stats immediately after an operation. However, why not
> commit first and then do a post-commit action to update the stats? If
> succeeds, good. If not, we will try again next time.
>
> On 2023/10/11 17:14:12 Ajantha Bhat wrote:
> > Hi All,
> >
> > As per the above proposal, I have worked on a POC (
> > https://github.com/apache/iceberg/pull/8488).
> >
> > *But to move things forward, first we need to merge the spec PR
> > (https://github.com/apache/iceberg/pull/7105
> > ). *I don't see any blocker
> > for the spec. Please review and approve if it is ok.
> >
> > One topic that came up during the review is whether to write stats
> > synchronously or asynchronously.
> > My suggestion is that we need to support both. I think we can first have
> an
> > async writing implementation.
> > But we also need to support sync stats writing with writes (controlled
> by a
> > table property).
> >
> > Some engines like Trino, Dremio can make use of the sync writing of the
> > stats.
> > Currently Puffin stats also supports sync writing from Trino.
> >
> > Thanks,
> > Ajantha
> >
> >
> > On Mon, May 22, 2023 at 10:15 PM Ryan Blue  wrote:
> >
> > > Thanks, Ajantha. I think it's safe to say that we should continue
> assuming
> > > that we will have one partition stats file. I agree that it should be
> small
> > > and we don't want to block the progress here.
> > >
> > > On Mon, May 22, 2023 at 5:07 AM Ajantha Bhat 
> > > wrote:
> > >
> > >> Hi Anton and Ryan,
> > >>
> > >> The Partition stats spec PR <
> https://github.com/apache/iceberg/pull/7105> didn't
> > >> move forward as Anton wanted to conduct some experiments to conclude
> > >> whether single-file writing or multiple files is better.
> > >> I conducted the experiments myself and attached some numbers in the
> PR.
> > >>
> > >> I would like to take this forward.
> > >> Please let me know what you think (can comment on the PR).
> > >>
> > >> As the output file is very small and initially the stats are computed
> > >> asynchronously,
> > >> I think writing them as a single file should be good enough.
> > >> In future, If we need faster stats writing (along with each write
> > >> operation) we can also implement multiple stats files.
> > >>
> > >> Just like how copy-on-write and merge-on-read are serving their use
> cases
> > >> in Iceberg,
> > >> we might have to support both single-file writing and multiple-file
> > >> writing in the long run.
> > >>
> > >> Thanks,
> > >> Ajantha
> > >>
> > >> On Wed, May 17, 2023 at 1:38 AM Mayur Srivastava <
> > >> mayur.srivast...@twosigma.com> wrote:
> > >>
> > >>> I agree, it totally depends on the way “last modified time” per
> > >>> partition is implemented.
> > >>>
> > >>> I’m concerned about performance of computing partition stats (and
> > >>> storage + the size of table metadata files) if the implementation
> requires
> > >>> users to keep around