from:"Wing Yew Poon"

Re: Welcome new committers and PMC!

2023-05-03 Thread Wing Yew Poon

Congratulations, Amogh, Eduard and Szehon! Well deserved!

On Wed, May 3, 2023 at 12:07 PM Ryan Blue  wrote:

> Hi everyone,
>
> I want to congratulate Amogh and Eduard, who were just added as Ierberg
> committers and Szehon, who was just added to the PMC. Thanks for all your
> contributions!
>
> Ryan
>
> --
> Ryan Blue
>

Re: rewrite action for collate how can we pass date range?

2023-05-24 Thread Wing Yew Poon

Gaurav,

Is your data partitioned by date? If so, you can compact subsets of
partitions at a time. To do this using the Spark procedure, you pass a
where clause:

spark.sql("CALL catalog_name.system.rewrite_data_files(table => '...',
where => '...')")

If you use the RewriteDataFilesSparkAction, you call filter(Expression),
but then you have to pass in your where clause as an Iceberg Expression.
You can use
https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/spark/v3.3/spark/src/main/scala/org/apache/spark/sql/execution/datasources/SparkExpressionConverter.scala
as shown in
https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java#L133-L135
.

- Wing Yew

On Tue, May 23, 2023 at 10:13 PM Gaurav Agarwal 
wrote:

>
> On Wed, May 24, 2023, 10:41 AM Gaurav Agarwal 
> wrote:
>
>> I have one more query we are trying to compact files currently it is
>> taking time as have never compacted till now this is the first time we are
>> trying to perform compaction after 5 months of continuously loading data
>> We change the format of the table from 1 to 2 also in bwtween
>> The issue is we are sparkrewriteaction Java API to perform the collate
>> but it is taking 24 hours for us to complete the job will there be a way in
>> that api that i can pass date range options are there but what parameters
>> should i pass there to make it date range
>>
>> Thanks
>>
>

allowing configs to be specified in SQLConf for Spark reads/writes

2023-06-16 Thread Wing Yew Poon

Hi,
I recently put up a PR, https://github.com/apache/iceberg/pull/7790, to
allow the write mode (copy-on-write/merge-on-read) to be specified in
SQLConf. The use case is explained in the PR.
Cheng Pan has an open PR, https://github.com/apache/iceberg/pull/7733, to
allow locality to be specified in SQLConf.
In the recent past, https://github.com/apache/iceberg/pull/6838/ was a PR
to allow the write distribution mode to be specified in SQLConf. This was
merged.
Cheng Pan asks if there is any guidance on when we should allow configs to
be specified in SQLConf.
Thanks,
Wing Yew

ps. The above open PRs could use reviews by committers.

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2023-07-14 Thread Wing Yew Poon

I think that different use cases benefit from or even require different
solutions. I think enabling options in Spark SQL is helpful, but allowing
some configurations to be done in SQLConf is also helpful.
For Cheng Pan's use case (to disable locality), I think providing a conf
(which can be added to spark-defaults.conf by a cluster admin) is useful.
For my customer's use case (https://github.com/apache/iceberg/pull/7790),
being able to set the write mode per Spark job (where right now it can only
be set as a table property) is useful. Allowing this to be done in the SQL
with an option/hint could also work, but as I understand it, Szehon's PR (
https://github.com/apache/spark/pull/416830) is only applicable to reads,
not writes.

- Wing Yew


On Thu, Jul 13, 2023 at 1:04 AM Cheng Pan  wrote:

> Ryan, I understand that option should be job-specific, and introducing an
> OPTIONS HINT can make Spark SQL achieves similar capabilities as DataFrame
> API does.
>
> My point is, some of the Iceberg options should not be job-specific.
>
> For example, Iceberg has an option “locality” which only allows setting at
> the job level, but Spark has a configuration
> “spark.shuffle.reduceLocality.enabled” which allows setting at the cluster
> level, this is a gap block Spark administers migrate to Iceberg because
> they can not disable it at the cluster level.
>
> So, what’s the principle in the Iceberg of classifying a configuration
> into SQLConf or OPTION?
>
> Thanks,
> Cheng Pan
>
>
>
>
> > On Jul 5, 2023, at 16:26, Cheng Pan  wrote:
> >
> > I would argue that the SQLConf way is more in line with Spark
> user/administrator habits.
> >
> > It’s a common practice that Spark administrators set configurations in
> spark-defaults.conf at the cluster level , and when the user has issues
> with their Spark SQL/Jobs, the first question they asked mostly is: can it
> be fixed by adding a spark configuration?
> >
> > The OPTIONS way brings additional learning efforts to Spark users and
> how can Spark administrators set them at cluster level?
> >
> > Thanks,
> > Cheng Pan
> >
> >
> >
> >
> >> On Jun 17, 2023, at 04:01, Wing Yew Poon 
> wrote:
> >>
> >> Hi,
> >> I recently put up a PR, https://github.com/apache/iceberg/pull/7790,
> to allow the write mode (copy-on-write/merge-on-read) to be specified in
> SQLConf. The use case is explained in the PR.
> >> Cheng Pan has an open PR, https://github.com/apache/iceberg/pull/7733,
> to allow locality to be specified in SQLConf.
> >> In the recent past, https://github.com/apache/iceberg/pull/6838/ was a
> PR to allow the write distribution mode to be specified in SQLConf. This
> was merged.
> >> Cheng Pan asks if there is any guidance on when we should allow configs
> to be specified in SQLConf.
> >> Thanks,
> >> Wing Yew
> >>
> >> ps. The above open PRs could use reviews by committers.
> >>
> >
>
>

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2023-07-14 Thread Wing Yew Poon

Also, in the case of write mode (I mean write.delete.mode,
write.update.mode, write.merge.mode), these cannot be set as options
currently; they are only settable as table properties.

On Fri, Jul 14, 2023 at 5:58 PM Wing Yew Poon  wrote:

> I think that different use cases benefit from or even require different
> solutions. I think enabling options in Spark SQL is helpful, but allowing
> some configurations to be done in SQLConf is also helpful.
> For Cheng Pan's use case (to disable locality), I think providing a conf
> (which can be added to spark-defaults.conf by a cluster admin) is useful.
> For my customer's use case (https://github.com/apache/iceberg/pull/7790),
> being able to set the write mode per Spark job (where right now it can only
> be set as a table property) is useful. Allowing this to be done in the SQL
> with an option/hint could also work, but as I understand it, Szehon's PR (
> https://github.com/apache/spark/pull/416830) is only applicable to reads,
> not writes.
>
> - Wing Yew
>
>
> On Thu, Jul 13, 2023 at 1:04 AM Cheng Pan  wrote:
>
>> Ryan, I understand that option should be job-specific, and introducing an
>> OPTIONS HINT can make Spark SQL achieves similar capabilities as DataFrame
>> API does.
>>
>> My point is, some of the Iceberg options should not be job-specific.
>>
>> For example, Iceberg has an option “locality” which only allows setting
>> at the job level, but Spark has a configuration
>> “spark.shuffle.reduceLocality.enabled” which allows setting at the cluster
>> level, this is a gap block Spark administers migrate to Iceberg because
>> they can not disable it at the cluster level.
>>
>> So, what’s the principle in the Iceberg of classifying a configuration
>> into SQLConf or OPTION?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>>
>>
>> > On Jul 5, 2023, at 16:26, Cheng Pan  wrote:
>> >
>> > I would argue that the SQLConf way is more in line with Spark
>> user/administrator habits.
>> >
>> > It’s a common practice that Spark administrators set configurations in
>> spark-defaults.conf at the cluster level , and when the user has issues
>> with their Spark SQL/Jobs, the first question they asked mostly is: can it
>> be fixed by adding a spark configuration?
>> >
>> > The OPTIONS way brings additional learning efforts to Spark users and
>> how can Spark administrators set them at cluster level?
>> >
>> > Thanks,
>> > Cheng Pan
>> >
>> >
>> >
>> >
>> >> On Jun 17, 2023, at 04:01, Wing Yew Poon 
>> wrote:
>> >>
>> >> Hi,
>> >> I recently put up a PR, https://github.com/apache/iceberg/pull/7790,
>> to allow the write mode (copy-on-write/merge-on-read) to be specified in
>> SQLConf. The use case is explained in the PR.
>> >> Cheng Pan has an open PR, https://github.com/apache/iceberg/pull/7733,
>> to allow locality to be specified in SQLConf.
>> >> In the recent past, https://github.com/apache/iceberg/pull/6838/ was
>> a PR to allow the write distribution mode to be specified in SQLConf. This
>> was merged.
>> >> Cheng Pan asks if there is any guidance on when we should allow
>> configs to be specified in SQLConf.
>> >> Thanks,
>> >> Wing Yew
>> >>
>> >> ps. The above open PRs could use reviews by committers.
>> >>
>> >
>>
>>

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2023-07-26 Thread Wing Yew Poon

I was on vacation.
Currently, write modes (copy-on-write/merge-on-read) can only be set as
table properties, and default to copy-on-write. We have a customer who
wants to use copy-on-write for certain Spark jobs that write to some
Iceberg table and merge-on-read for other Spark jobs writing to the same
table, because of the write characteristics of those jobs. This seems like
a use case that should be supported. The only way they can do this
currently is to toggle the table property as needed before doing the
writes. This is not a sustainable workaround.
Hence, I think it would be useful to be able to configure the write mode as
a SQLConf. I also disagree that the table property should always win. If
this is the case, there is no way to override it. The existing behavior in
SparkConfParser is to use the option if set, else use the session conf if
set, else use the table property. This applies across the board.
- Wing Yew






On Sun, Jul 16, 2023 at 4:48 PM Ryan Blue  wrote:

> Yes, I agree that there is value for administrators from having some
> things exposed as Spark SQL configuration. That gets much harder when you
> want to use the SQLConf for table-level settings, though. For example, the
> target split size is something that was an engine setting in the Hadoop
> world, even though it makes no sense to use the same setting across vastly
> different tables --- think about joining a fact table with a dimension
> table.
>
> Settings like write mode are table-level settings. It matters what is
> downstream of the table. You may want to set a *default* write mode, but
> the table-level setting should always win. Currently, there are limits to
> overriding the write mode in SQL. That's why we should add hints. For
> anything beyond that, I think we need to discuss what you're trying to do.
> If it's to override a table-level setting with a SQL global, then we should
> understand the use case better.
>
> On Fri, Jul 14, 2023 at 6:09 PM Wing Yew Poon 
> wrote:
>
>> Also, in the case of write mode (I mean write.delete.mode,
>> write.update.mode, write.merge.mode), these cannot be set as options
>> currently; they are only settable as table properties.
>>
>> On Fri, Jul 14, 2023 at 5:58 PM Wing Yew Poon 
>> wrote:
>>
>>> I think that different use cases benefit from or even require different
>>> solutions. I think enabling options in Spark SQL is helpful, but allowing
>>> some configurations to be done in SQLConf is also helpful.
>>> For Cheng Pan's use case (to disable locality), I think providing a conf
>>> (which can be added to spark-defaults.conf by a cluster admin) is useful.
>>> For my customer's use case (https://github.com/apache/iceberg/pull/7790),
>>> being able to set the write mode per Spark job (where right now it can only
>>> be set as a table property) is useful. Allowing this to be done in the SQL
>>> with an option/hint could also work, but as I understand it, Szehon's PR (
>>> https://github.com/apache/spark/pull/416830) is only applicable to
>>> reads, not writes.
>>>
>>> - Wing Yew
>>>
>>>
>>> On Thu, Jul 13, 2023 at 1:04 AM Cheng Pan  wrote:
>>>
>>>> Ryan, I understand that option should be job-specific, and introducing
>>>> an OPTIONS HINT can make Spark SQL achieves similar capabilities as
>>>> DataFrame API does.
>>>>
>>>> My point is, some of the Iceberg options should not be job-specific.
>>>>
>>>> For example, Iceberg has an option “locality” which only allows setting
>>>> at the job level, but Spark has a configuration
>>>> “spark.shuffle.reduceLocality.enabled” which allows setting at the cluster
>>>> level, this is a gap block Spark administers migrate to Iceberg because
>>>> they can not disable it at the cluster level.
>>>>
>>>> So, what’s the principle in the Iceberg of classifying a configuration
>>>> into SQLConf or OPTION?
>>>>
>>>> Thanks,
>>>> Cheng Pan
>>>>
>>>>
>>>>
>>>>
>>>> > On Jul 5, 2023, at 16:26, Cheng Pan  wrote:
>>>> >
>>>> > I would argue that the SQLConf way is more in line with Spark
>>>> user/administrator habits.
>>>> >
>>>> > It’s a common practice that Spark administrators set configurations
>>>> in spark-defaults.conf at the cluster level , and when the user has issues
>>>> with their Spark SQL/Jobs, the first question they asked mostly is: can it
>>>> be fixed by adding a spark configuration?
>>>

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2023-07-26 Thread Wing Yew Poon

We are talking about DELETE/UPDATE/MERGE operations. There is only SQL
support for these operations. There is no DataFrame API support for them.*
Therefore write options are not applicable. Thus SQLConf is the only
available mechanism I can use to override the table property.
For reference, we currently support setting distribution mode using write
option, SQLConf and table property. It seems to me that
https://github.com/apache/iceberg/pull/6838/ is a precedent for what I'd
like to do.

* It would be of interest to support performing DELETE/UPDATE/MERGE from
DataFrames, but that is a whole other topic.


On Wed, Jul 26, 2023 at 12:04 PM Ryan Blue  wrote:

> I think we should aim to have the same behavior across properties that are
> set in SQL conf, table config, and write options. Having SQL conf override
> table config for this doesn't make sense to me. If the need is to override
> table configuration, then write options are the right way to do it.
>
> On Wed, Jul 26, 2023 at 10:10 AM Wing Yew Poon 
> wrote:
>
>> I was on vacation.
>> Currently, write modes (copy-on-write/merge-on-read) can only be set as
>> table properties, and default to copy-on-write. We have a customer who
>> wants to use copy-on-write for certain Spark jobs that write to some
>> Iceberg table and merge-on-read for other Spark jobs writing to the same
>> table, because of the write characteristics of those jobs. This seems like
>> a use case that should be supported. The only way they can do this
>> currently is to toggle the table property as needed before doing the
>> writes. This is not a sustainable workaround.
>> Hence, I think it would be useful to be able to configure the write mode
>> as a SQLConf. I also disagree that the table property should always win. If
>> this is the case, there is no way to override it. The existing behavior in
>> SparkConfParser is to use the option if set, else use the session conf if
>> set, else use the table property. This applies across the board.
>> - Wing Yew
>>
>>
>>
>>
>>
>>
>> On Sun, Jul 16, 2023 at 4:48 PM Ryan Blue  wrote:
>>
>>> Yes, I agree that there is value for administrators from having some
>>> things exposed as Spark SQL configuration. That gets much harder when you
>>> want to use the SQLConf for table-level settings, though. For example, the
>>> target split size is something that was an engine setting in the Hadoop
>>> world, even though it makes no sense to use the same setting across vastly
>>> different tables --- think about joining a fact table with a dimension
>>> table.
>>>
>>> Settings like write mode are table-level settings. It matters what is
>>> downstream of the table. You may want to set a *default* write mode, but
>>> the table-level setting should always win. Currently, there are limits to
>>> overriding the write mode in SQL. That's why we should add hints. For
>>> anything beyond that, I think we need to discuss what you're trying to do.
>>> If it's to override a table-level setting with a SQL global, then we should
>>> understand the use case better.
>>>
>>> On Fri, Jul 14, 2023 at 6:09 PM Wing Yew Poon
>>>  wrote:
>>>
>>>> Also, in the case of write mode (I mean write.delete.mode,
>>>> write.update.mode, write.merge.mode), these cannot be set as options
>>>> currently; they are only settable as table properties.
>>>>
>>>> On Fri, Jul 14, 2023 at 5:58 PM Wing Yew Poon 
>>>> wrote:
>>>>
>>>>> I think that different use cases benefit from or even require
>>>>> different solutions. I think enabling options in Spark SQL is helpful, but
>>>>> allowing some configurations to be done in SQLConf is also helpful.
>>>>> For Cheng Pan's use case (to disable locality), I think providing a
>>>>> conf (which can be added to spark-defaults.conf by a cluster admin) is
>>>>> useful.
>>>>> For my customer's use case (
>>>>> https://github.com/apache/iceberg/pull/7790), being able to set the
>>>>> write mode per Spark job (where right now it can only be set as a table
>>>>> property) is useful. Allowing this to be done in the SQL with an
>>>>> option/hint could also work, but as I understand it, Szehon's PR (
>>>>> https://github.com/apache/spark/pull/416830) is only applicable to
>>>>> reads, not writes.
>>>>>
>>>>> - Wing Yew
>>>>>
>>>>>
>>>>> On Thu, Jul

Re: Is there a way to distcp iceberg table from hadoop?

2023-12-02 Thread Wing Yew Poon

Aren't we forgetting about position delete files? If the table has position
delete files, then those contain absolute file paths as well.
We cannot add them to the table as-is. We need to rewrite them. This, I
think, is the most painful part of replicating an Iceberg table.
- Wing Yew


On Sat, Dec 2, 2023 at 5:23 PM Fokko Driesprong  wrote:

> Hi Dongjun,
>
> Thanks for reaching out on the mailinglist. Another option might be to
> copy the data, and then use a Spark procedure, called add_files
>  to
> add the files to the table. Let me know if this works for you.
>
> Kind regards,
> Fokko
>
> Op za 2 dec 2023 om 02:43 schreef Ajantha Bhat :
>
>> Hi,
>>
>> You are right. Moving Iceberg tables from storage and expecting them to
>> function at the new location is not currently feasible.
>> The issue lies in the metadata files, which store the absolute path.
>>
>> To address this, we need support for relative paths, but it appears that
>> progress on this front has been slow.
>> You can monitor the status of this feature at
>> https://github.com/apache/iceberg/pull/8260.
>>
>> As a temporary fix, you can use the CTAS method to create a duplicate
>> copy of the table at the desired new path.
>>
>> Thanks,
>> Ajantha
>>
>> On Fri, Dec 1, 2023 at 10:01 PM Dongjun Hwang 
>> wrote:
>>
>>> Hello! My name is Dongjun Hwang.
>>>
>>> I recently performed distcp on the iceberg table in Hadoop.
>>>
>>> Data search was not possible because all file paths in the metadata
>>> directory were not changed.
>>>
>>> Is there a way to distcp the iceberg table?
>>>
>>> thang you!!
>>>
>>

Re: Community Meeting Minutes ?

2023-12-06 Thread Wing Yew Poon

The meeting minutes and a link to the recording used to be sent out to this
list regularly soon after the community sync. I have not been able to
attend the sync recently and I haven't seen the minutes for the last two
syncs. Can we please maintain the practice of sending the minutes and
recording out?
Thanks,
Wing Yew


On Fri, Oct 27, 2023 at 2:40 AM Jean-Baptiste Onofré 
wrote:

> Thanks Brian, much appreciated!
>
> Regards
> JB
>
> On Thu, Oct 26, 2023 at 10:29 PM Brian Olsen 
> wrote:
> >
> > Thanks for the reminder here JB. I just created a list to follow for
> this process so I don't forget. At some point, I'll add it to the
> documentation so that anyone can run this over time. I will share out the
> last few meeting minutes in their own threads now.
> >
> > On Thu, Oct 12, 2023 at 9:03 AM Jean-Baptiste Onofré 
> wrote:
> >>
> >> Hi guys,
> >>
> >> Thanks for the community meeting yesterday, it was super interesting
> >> and motivating :)
> >>
> >> As we say at Apache: "If it didn't happen on the mailing list, it
> >> never happened" :)
> >> In order to give a chance to anyone in the community to see the topics
> >> and participate, it would be great to share the meeting minutes on the
> >> mailing list.
> >>
> >> I know Brian did that in July. It would be great to do it
> "systematically".
> >>
> >> @Brian do you mind sharing the meeting minutes on the mailing list ?
> >> Do you need my help to complete/review ?
> >> Maybe we can add it on the website too ?
> >>
> >> Thanks !
> >> Regards
> >> JB
>

Re: Community Meeting Minutes ?

2023-12-08 Thread Wing Yew Poon

Brian,
Thanks for sending out the meeting minutes (the updated version looks
good!).
- Wing Yew


On Thu, Dec 7, 2023 at 2:07 PM Brian Olsen  wrote:

> Hey Wing Yew,
>
> Sorry about this. I am just about to publish the last two. Me and the
> other person that is responsible for these were hit by a series of family
> and medical issues so apologies. I will put some better backups into place
> in the unlikely event we are both out of commission.
>
>  Thanks for the push and stand by for the meeting minutes.
>
> On Wed, Dec 6, 2023 at 3:06 PM Wing Yew Poon 
> wrote:
>
>> The meeting minutes and a link to the recording used to be sent out to
>> this list regularly soon after the community sync. I have not been able to
>> attend the sync recently and I haven't seen the minutes for the last two
>> syncs. Can we please maintain the practice of sending the minutes and
>> recording out?
>> Thanks,
>> Wing Yew
>>
>>
>> On Fri, Oct 27, 2023 at 2:40 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Thanks Brian, much appreciated!
>>>
>>> Regards
>>> JB
>>>
>>> On Thu, Oct 26, 2023 at 10:29 PM Brian Olsen 
>>> wrote:
>>> >
>>> > Thanks for the reminder here JB. I just created a list to follow for
>>> this process so I don't forget. At some point, I'll add it to the
>>> documentation so that anyone can run this over time. I will share out the
>>> last few meeting minutes in their own threads now.
>>> >
>>> > On Thu, Oct 12, 2023 at 9:03 AM Jean-Baptiste Onofré 
>>> wrote:
>>> >>
>>> >> Hi guys,
>>> >>
>>> >> Thanks for the community meeting yesterday, it was super interesting
>>> >> and motivating :)
>>> >>
>>> >> As we say at Apache: "If it didn't happen on the mailing list, it
>>> >> never happened" :)
>>> >> In order to give a chance to anyone in the community to see the topics
>>> >> and participate, it would be great to share the meeting minutes on the
>>> >> mailing list.
>>> >>
>>> >> I know Brian did that in July. It would be great to do it
>>> "systematically".
>>> >>
>>> >> @Brian do you mind sharing the meeting minutes on the mailing list ?
>>> >> Do you need my help to complete/review ?
>>> >> Maybe we can add it on the website too ?
>>> >>
>>> >> Thanks !
>>> >> Regards
>>> >> JB
>>>
>>

spec question on equality deletes

2024-04-12 Thread Wing Yew Poon

Hi,

I have some questions on the current Iceberg spec regarding equality
deletes:
https://iceberg.apache.org/spec/#equality-delete-files
The spec says that for "a table with the following data:

  1: id | 2:
category | 3: name
---|-|-
 1 | marsupial
 | Koala  2 |
toy | Teddy
 3 | NULL
 | Grizzly  4 |
NULL| Polar

The delete id = 3 could be written as either of the following equality
delete files:

 equality_ids=[1]

 1: id
---
 3

equality_ids=[1] 
 1: id | 2:
category | 3: name
---|-|-
 3 | NULL
 | Grizzly

"

1. Are the options either (a) write only the column(s) listed in
equality_ids or (b) write all the columns? i.e, no in between.
2. If we write all the columns, are only columns listed in equality_ids
considered? What happens if a non-equality_id column does not match? e.g.,

equality_ids=[1] 
 1: id | 2: category |
3: name 
---|-|-
 3 | NULL | Polar

Is that (a) invalid, or does that (b) still result in deleting id = 3, or
(c) result in deleting no rows?

The spec says "Each row of the delete file produces one equality predicate
that matches any row where the delete columns are equal. Multiple columns
can be thought of as an AND of equality predicates." That could be
interpreted to mean (c).

Thanks,
Wing Yew

Re: spec question on equality deletes

2024-04-15 Thread Wing Yew Poon

Hi Renjie,
Thank you for your perspective.
On 1, I am inclined to the same view as you.
On 2, I feel that the spec should clearly define the expected behavior; it
should not be left to engines. At worst, the spec can say, e.g., that the
correct behavior is (b) but it is acceptable for an engine to throw an
error (a); or that the correct behavior is (c). We cannot have some engines
doing (b) and some doing (c), as (b) and (c) are basically opposite.
I'm interested in other perspectives.
- Wing Yew

Btw, I go by Wing Yew, not Wing.


On Sat, Apr 13, 2024 at 6:12 AM Renjie Liu  wrote:

> Hi, Wing:
>
>
>
> 1. Are the options either (a) write only the column(s) listed in
> equality_ids or (b) write all the columns? i.e, no in between.
>
>
>
> Yes, I think so.
>
>
>
> 2. If we write all the columns, are only columns listed in equality_ids
> considered? What happens if a non-equality_id column does not match? e.g.,
>
>
>
> equality_ids=[1] 1: id | 2: category | 3: name
> ---|-|- 3 | NULL | Polar
>
>
>
> Is that (a) invalid, or does that (b) still result in deleting id = 3, or
> (c) result in deleting no rows?
>
>
>
> What columns are considered are depent:
>
>- Only columns listed in eqality_ids are considered when applying
>deletions.
>- If other columns are filled, they are considered during planning,
>e.g. helps to prune equal deletion files that should be applied to data
>file.
>
>
>
> I think it’s considered as invalid since it may produce wrong results,
> e.g. pruning extra deletion file.
>
>
>
> The spec says "Each row of the delete file produces one equality
> predicate that matches any row where the delete columns are equal. Multiple
> columns can be thought of as an AND of equality predicates." That could
> be interpreted to mean (c).
>
>
>
> Whether it’s incorrect depends on how the compute engine works. If the
> compute engine doesn’t try to prune deletion files, then inconsistent
>  column data may  not affect the result. But in general it should be
> considered as incorrect data.
>
>
>
> *From: *Wing Yew Poon 
> *Date: *Saturday, April 13, 2024 at 02:16
> *To: *dev@iceberg.apache.org 
> *Subject: *spec question on equality deletes
>
> Hi,
>
>
>
> I have some questions on the current Iceberg spec regarding equality
> deletes:
>
> https://iceberg.apache.org/spec/#equality-delete-files
>
> The spec says that for "a table with the following data:
>
>  1: id | 2: category | 3: name
>
> ---|-|-
>
>  1 | marsupial   | Koala
>
>  2 | toy | Teddy
>
>  3 | NULL| Grizzly
>
>  4 | NULL| Polar
>
> The delete id = 3 could be written as either of the following equality
> delete files:
>
> equality_ids=[1]
>
>
>
>  1: id
>
> ---
>
>  3
>
> equality_ids=[1]
>
>
>
>  1: id | 2: category | 3: name
>
> ---|-|-
>
>  3 | NULL| Grizzly
>
> "
>
>
>
> 1. Are the options either (a) write only the column(s) listed in
> equality_ids or (b) write all the columns? i.e, no in between.
>
> 2. If we write all the columns, are only columns listed in equality_ids
> considered? What happens if a non-equality_id column does not match? e.g.,
>
>
>
> equality_ids=[1] 1: id | 2: category | 3: name ---|-|-
> 3 | NULL | Polar
>
>
>
> Is that (a) invalid, or does that (b) still result in deleting id = 3, or
> (c) result in deleting no rows?
>
>
>
> The spec says "Each row of the delete file produces one equality
> predicate that matches any row where the delete columns are equal. Multiple
> columns can be thought of as an AND of equality predicates." That could
> be interpreted to mean (c).
>
>
>
> Thanks,
>
> Wing Yew
>
>
>

Re: spec question on equality deletes

2024-04-15 Thread Wing Yew Poon

Hi Yufei,
Thank you for your response.
It sounds like on 2, your thinking is that (b) is the correct behavior.
Indeed, I have tried it out with Spark and afaict, it does (b). However,
that does not mean that it is the correct behavior. The spec should clearly
define it.
- Wing Yew


On Mon, Apr 15, 2024 at 5:25 PM Yufei Gu  wrote:

> Hi Wing Yew Poon,
>
> Here is my understanding, but not necessarily how an engine implements it.
> It should only consider the columns in equality_ids when we apply eq
> deletes. Also the engine should ignore the unrelated columns.
> It will still delete the row with id 3 in the following case you described
> even if the name doesn't match.
> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2>
> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category |
> 3: name 
> <https://iceberg.apache.org/spec/#__codelineno-3-4>---|-|-
> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar
>
> To verify the behavior, we can check the test case
> like TestSparkReaderDeletes::testReadEqualityDeleteRows.
>
> Yufei
>
>
> On Fri, Apr 12, 2024 at 11:16 AM Wing Yew Poon 
> wrote:
>
>> Hi,
>>
>> I have some questions on the current Iceberg spec regarding equality
>> deletes:
>> https://iceberg.apache.org/spec/#equality-delete-files
>> The spec says that for "a table with the following data:
>>
>>  <https://iceberg.apache.org/spec/#__codelineno-1-1> 1: id | 2: category | 
>> 3: name 
>> <https://iceberg.apache.org/spec/#__codelineno-1-2>---|-|-
>>  <https://iceberg.apache.org/spec/#__codelineno-1-3> 1 | marsupial   | 
>> Koala <https://iceberg.apache.org/spec/#__codelineno-1-4> 2 | toy
>>  | Teddy <https://iceberg.apache.org/spec/#__codelineno-1-5> 3 | NULL
>> | Grizzly <https://iceberg.apache.org/spec/#__codelineno-1-6> 4 | 
>> NULL| Polar
>>
>> The delete id = 3 could be written as either of the following equality
>> delete files:
>>
>>  <https://iceberg.apache.org/spec/#__codelineno-2-1>equality_ids=[1] 
>> <https://iceberg.apache.org/spec/#__codelineno-2-2> 
>> <https://iceberg.apache.org/spec/#__codelineno-2-3> 1: id 
>> <https://iceberg.apache.org/spec/#__codelineno-2-4>--- 
>> <https://iceberg.apache.org/spec/#__codelineno-2-5> 3
>>
>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2> 
>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category | 3: 
>> name 
>> <https://iceberg.apache.org/spec/#__codelineno-3-4>---|-|-
>>  <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL| 
>> Grizzly
>>
>> "
>>
>> 1. Are the options either (a) write only the column(s) listed in
>> equality_ids or (b) write all the columns? i.e, no in between.
>> 2. If we write all the columns, are only columns listed in equality_ids
>> considered? What happens if a non-equality_id column does not match? e.g.,
>>
>
>> equality_ids=[1] <https://iceberg.apache.org/spec/#__codelineno-3-2>
>> <https://iceberg.apache.org/spec/#__codelineno-3-3> 1: id | 2: category
>> | 3: name 
>> <https://iceberg.apache.org/spec/#__codelineno-3-4>---|-|-
>> <https://iceberg.apache.org/spec/#__codelineno-3-5> 3 | NULL | Polar
>>
>> Is that (a) invalid, or does that (b) still result in deleting id = 3,
>> or (c) result in deleting no rows?
>>
>> The spec says "Each row of the delete file produces one equality
>> predicate that matches any row where the delete columns are equal. Multiple
>> columns can be thought of as an AND of equality predicates." That could
>> be interpreted to mean (c).
>>
>> Thanks,
>> Wing Yew
>>
>>

Re: Iceberg Materialized View Meeting

2024-06-04 Thread Wing Yew Poon

Can you please record the meeting and make the recording available
afterwards?
Thanks,
Wing Yew


On Mon, Jun 3, 2024 at 11:32 PM Benny Chow  wrote:

> Thanks for organizing Jan.   I’ll be there!
>
> Benny
>
> On Jun 3, 2024, at 11:15 PM, Jan Kaul  wrote:
>
> 
>
> Hi all,
>
> we will have a video call to get together and discuss Iceberg Materialized
> Views. The call is on *Wednesday, 5 June 2024, 16:00:00 UTC (9:00 PDT)*
> and you can join the meeting with the following link:
>
> https://meet.google.com/ttr-xwnk-wiz
>
> On the agenda are:
>
>- Store "Storage table pointer" as view property or new metadata field?
>- Represent "Refresh information" (Lineage) as multiple properties or
>as single nested object?
>- Which fields are required for the "Refresh information"?
>
> It would be great if you could join the discussion. Looking forward to
> discussing with you.
>
> Regards,
>
> Jan
> 
>
>

Re: [DISCUSS] Enable the discussion tab for iceberg github repos

2024-07-09 Thread Wing Yew Poon

I am not familiar with the GitHub discussion feature and do not have an
opinion about using it.
I do think though that it would be useful to have a user list as well as a
dev list for Apache Iceberg. Many Apache projects have both. Discussions
about project work should continue to happen on the dev list. A user list
would be for users to ask questions about using Iceberg, to seek help, and
also to report problems (potential bugs), which if confirmed, could be
reported as GitHub issues. Oftentimes, in the absence of the user list,
users resort to opening a GitHub issue to ask a question. Of course, there
is Slack, but I think it wouldn't hurt to have a user list as another
channel.

- Wing Yew


On Tue, Jul 9, 2024 at 2:07 PM Piotr Findeisen 
wrote:

> Hi,
>
> I totally hear Ryan's concerns about further dividing the discussion. I
> had the same feeling when we opened discussions in Trino.
> The reality was more positive though. Discussions predominantly serve as a
> way to ask questions rather than drive decision-making in the project.
> Thinking from user perspective -- is it obvious that they can send an
> email to the dev list? or should they join slack?
> Discussions look very accessible to those that have a github account
> already, but we should probably make sure we don't move there content that
> belongs on the dev list.
>
> The other concern is whether the discussions will be getting attention
> from other project contributors. I.e. if someone is looking for help, are
> they getting it?
> We should at least informally monitor the situation here.
>
> Best
> Piotr
>
>
>
>
>
>
>
>
>
>
>
> On Tue, 9 Jul 2024 at 17:36, Jack Ye  wrote:
>
>> I am not familiar with the GitHub discussion feature, but could we start
>> with GitHub Issue tags + templates to distinguish between actual issues vs
>> this kind of questions? Why is that not sufficient?
>>
>> Also, if there are a lot of questions about the roadmap, I think we
>> should discuss and make good milestones for the project that are decoupled
>> from releases.
>>
>> I remember there was a similar question since we removed the roadmap page
>> in the website: https://github.com/apache/iceberg/issues/10390, maybe we
>> should reconsider adding at least a pointer in the website to the
>> milestones page.
>>
>> -Jack
>>
>> On Tue, Jul 9, 2024 at 8:32 AM Ryan Blue 
>> wrote:
>>
>>> My only concern about using this tool is that we may be
>>> further separating where discussion happens and not everyone will see
>>> what's happening. Usually, the dev list is the canonical place for
>>> discussions. Is that not a good solution? What differentiates what we would
>>> use github discussions for vs the dev list?
>>>
>>> On Tue, Jul 9, 2024 at 6:52 AM Eduard Tudenhöfner <
>>> etudenhoef...@apache.org> wrote:
>>>
 I think GH discussions would be great to have on the Iceberg repo(s),
 so +1 from my side on this.

 Eduard

 On Tue, Jul 9, 2024 at 8:14 AM Renjie Liu 
 wrote:

> Hi:
>
> It's also possible to create a user mailing list if it helps.
>
>
> I'm neutral to this option. Seems we are actually missing the user
> mail list.
>
> On Tue, Jul 9, 2024 at 1:50 PM Xuanwo  wrote:
>
>> Hi,
>>
>> > Regarding the discussion tab, it sounds good to me. It's pretty
>> straight forward to do by editing .asf.yaml.
>>
>> I tried this before. But the asf.yaml doesn't support controling
>> discussion yet.
>> We need the help from infra team.
>>
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=git+-+.asf.yaml+features#Git.asf.yamlfeatures-GitHubDiscussions
>>
>>
>> On Tue, Jul 9, 2024, at 13:44, Jean-Baptiste Onofré wrote:
>> > Hi
>> >
>> > It's also possible to create a user mailing list if it helps.
>> >
>> > Regarding the discussion tab, it sounds good to me. It's pretty
>> > straight forward to do by editing .asf.yaml.
>> >
>> > Regards
>> > JB
>> >
>> > On Tue, Jul 9, 2024 at 5:18 AM Renjie Liu 
>> wrote:
>> >>
>> >> Hi:
>> >>
>> >> Recently we have observed more and more user interested in
>> iceberg-rust, and they have many questions about it, for example the
>> status, relationship with others such pyiceberg. Slack is a great place 
>> to
>> discussion, but is not friendly for long discussion and not easy to
>> comment. We can also encourage user to use github issue, but it's easy to
>> mix with true issues, e.g. feature tracking, bug tracking, etc.
>> >>
>> >> So I propose to enable the discussion tab for  repos of iceberg
>> and subprojects such as iceberg-rust, pyiceberg, iceberg-go.
>>
>> --
>> Xuanwo
>>
>> https://xuanwo.io/
>>
>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2024-07-09 Thread Wing Yew Poon

Hi Szehon,
Thanks for the update.
Can you please point me to the work on supporting DELETE/UPDATE/MERGE in
the DataFrame API?
Thanks,
Wing Yew


On Tue, Jul 9, 2024 at 10:05 PM Szehon Ho  wrote:

> Hi,
>
> Just FYI, good news, this change is merged on the Spark side :
> https://github.com/apache/spark/pull/46707 (its the third effort!).  In
> next version of Spark, we will be able to pass read properties via SQL to a
> particular Iceberg table such as
>
> SELECT * FROM iceberg.db.table1 WITH (`locality` = `true`)
>
> I will look at write options after this.
>
> There's also progress in supporting DELETE/UPDATE/MERGE from Dataframes as
> well, it should also be coming soon in Spark.
>
> Thanks,
> Szehon
>
>
>
> On Wed, Jul 26, 2023 at 12:46 PM Wing Yew Poon 
> wrote:
>
>> We are talking about DELETE/UPDATE/MERGE operations. There is only SQL
>> support for these operations. There is no DataFrame API support for them.*
>> Therefore write options are not applicable. Thus SQLConf is the only
>> available mechanism I can use to override the table property.
>> For reference, we currently support setting distribution mode using write
>> option, SQLConf and table property. It seems to me that
>> https://github.com/apache/iceberg/pull/6838/ is a precedent for what I'd
>> like to do.
>>
>> * It would be of interest to support performing DELETE/UPDATE/MERGE from
>> DataFrames, but that is a whole other topic.
>>
>>
>> On Wed, Jul 26, 2023 at 12:04 PM Ryan Blue  wrote:
>>
>>> I think we should aim to have the same behavior across properties that
>>> are set in SQL conf, table config, and write options. Having SQL conf
>>> override table config for this doesn't make sense to me. If the need is to
>>> override table configuration, then write options are the right way to do it.
>>>
>>> On Wed, Jul 26, 2023 at 10:10 AM Wing Yew Poon
>>>  wrote:
>>>
>>>> I was on vacation.
>>>> Currently, write modes (copy-on-write/merge-on-read) can only be set as
>>>> table properties, and default to copy-on-write. We have a customer who
>>>> wants to use copy-on-write for certain Spark jobs that write to some
>>>> Iceberg table and merge-on-read for other Spark jobs writing to the same
>>>> table, because of the write characteristics of those jobs. This seems like
>>>> a use case that should be supported. The only way they can do this
>>>> currently is to toggle the table property as needed before doing the
>>>> writes. This is not a sustainable workaround.
>>>> Hence, I think it would be useful to be able to configure the write
>>>> mode as a SQLConf. I also disagree that the table property should always
>>>> win. If this is the case, there is no way to override it. The existing
>>>> behavior in SparkConfParser is to use the option if set, else use the
>>>> session conf if set, else use the table property. This applies across the
>>>> board.
>>>> - Wing Yew
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Jul 16, 2023 at 4:48 PM Ryan Blue  wrote:
>>>>
>>>>> Yes, I agree that there is value for administrators from having some
>>>>> things exposed as Spark SQL configuration. That gets much harder when you
>>>>> want to use the SQLConf for table-level settings, though. For example, the
>>>>> target split size is something that was an engine setting in the Hadoop
>>>>> world, even though it makes no sense to use the same setting across vastly
>>>>> different tables --- think about joining a fact table with a dimension
>>>>> table.
>>>>>
>>>>> Settings like write mode are table-level settings. It matters what is
>>>>> downstream of the table. You may want to set a *default* write mode, but
>>>>> the table-level setting should always win. Currently, there are limits to
>>>>> overriding the write mode in SQL. That's why we should add hints. For
>>>>> anything beyond that, I think we need to discuss what you're trying to do.
>>>>> If it's to override a table-level setting with a SQL global, then we 
>>>>> should
>>>>> understand the use case better.
>>>>>
>>>>> On Fri, Jul 14, 2023 at 6:09 PM Wing Yew Poon
>>>>>  wrote:
>>>>>
>>>>>> Also, in the case of write mode (I mean write.delete.mode,
>&g

Re: [DISCUSS][BYLAWS] Moving forward on the bylaws

2024-07-19 Thread Wing Yew Poon

Hi Owen,
Thanks for doing this.
Once you have the questions and choices, who gets to vote on them?
- Wing Yew


On Fri, Jul 19, 2024 at 10:07 AM Owen O'Malley 
wrote:

> All,
>Sorry for the long pause on bylaws discussion. It was a result of
> wanting to avoid the long US holiday week (July 4th) and my
> procrastination, which was furthered by a side conversation that asked me
> to consider how to move forward in an Apache way.
>   I'd like to thank Jack for moving this to this point. One concern that I
> had was there were lots of discussions and decisions that were being made
> off of our email lists, which isn't the way that Apache should work.
>   For finishing this off, I'd like to come up with a set of questions that
> should be answered by multiple choice questions and then use single
> transferable vote (STV) to resolve them. STV just means that each person
> lists their choices in a ranked order with a formal way to resolve how the
> votes work.
>   The questions that I have heard so far are:
>
>1. Should the PMC chair be term-limited and if so, what is the period? *In
>my experience, this isn't necessary in most projects and is often ignored.
>In Hadoop, Chris Douglas was a great chair and held it for 5 years in spite
>of the 1 year limit.*
>1. No term limit
>   2. 1 year
>   3. 2 year
>2. What should the minimum voting period be?* I'd suggest 3 days is
>far better as long as it isn't abused by holding important votes over
>holiday weekends.*
>1. 3 days (72 hours)
>   2. 7 days
>3. Should we keep the section on roles or just reference the Apache
>documentation . *I'd
>suggest that we reference the Apache documentation.*
>4. I'd like to include a couple sentences about the different hats at
>Apache and that votes should be for the benefit of the project and not our
>employers.
>5. I'd like to propose that we include text to formally include censor
>and potential removal for disclosing sensitive information from the private
>list.
>6. I'd like to propose branch committers. It has helped Hadoop a lot
>to enable people to work on development branches for large features before
>they are given general committership. It is better to have the branch work
>done at Apache and be visible than having large branches come in late in
>the project.
>7. Requirements for each topic (each could be consensus, lazy
>consensus, lazy majority, lazy 2/3's)
>1. Add committer
>   2. Remove committer
>   3. Add PMC
>   4. Remove PMC
>   5. Accept design proposal
>   6. Add subproject
>   7. Remove subproject
>   8. Release (can't be lazy consensus)
>   9. Modifying bylaws
>
> Thoughts? Missing questions?
>
> .. Owen
>

Re: Dropping JDK 8 support

2024-07-23 Thread Wing Yew Poon

I just wish to point out that when people started voting, the proposal was
"dropping JDK 8 support in Iceberg 2.0 release".
It's fine for people to propose dropping JDK8 support sooner than that (and
I'm not against that), but the proposal being voted on should not be
switched mid-vote.
- Wing Yew


On Tue, Jul 23, 2024 at 10:45 PM huaxin gao  wrote:

> I understand that transitioning from JDK 8 to JDK 11 requires some effort
> from the users. However, even if we wait until version 2.0, we
> still encounter the same problem. I don't see the need for more time to
> test the discontinuation of JDK 8 support. The configuration of Spark 3.5
> with JDK 11 and JDK 17 is very stable, and the majority of users are using
> this setting. Therefore, it seems to me that we don't need to wait more
> time to drop JDK 8 support.
>
> With that said, I don't have an extremely strong opinion on this matter.
> For Spark 4.0 support, I can change the spark-ci to only run Java 17 for
> Spark 4.0. However, I probably need to drop a couple of Java 8 CIs
> because they don't work with Spark 4.0.
>
> Thanks,
> Huaxin
>
> On Tue, Jul 23, 2024 at 8:11 PM Manu Zhang 
> wrote:
>
>> Yes, I'm asking for users who use JDK 8 with Spark 3.5. Users can
>> continue to use 1.6+ with Spark 3.5 and JDK 8 if we continue to support
>> them.
>> If we drop JDK 8 support after 1.6, then there might be issues for Spark
>> 3.5 with JDK 8 users.
>>
>> I'm +1 to drop JDK 8 support in 2.0. I think it's worth more discussion
>> and tests for dropping JDK 8 support in 1.6+ versions, which can be another
>> thread.
>>
>> On Wed, Jul 24, 2024 at 10:45 AM huaxin gao 
>> wrote:
>>
>>> Hi Manu,
>>> Thanks for the discussion. Is your concern about customers who use JDK 8
>>> with Spark 3.5? But we will face the same problem if we drop JDK 8 in
>>> Iceberg 2.0, unless we plan to drop Spark 3.5 support in 2.0.
>>>
>>> Huaxin
>>>
>>> On Tue, Jul 23, 2024 at 7:30 PM Renjie Liu 
>>> wrote:
>>>
 Hi, Manu:

 > If we drop JDK 8 support in 1.7, can Iceberg 1.7+ work seamlessly
 with Spark 3.5? Otherwise, users might get stuck in 1.6.

 I think spark 3.5 supports JDK 8/11/17 according to their doc. So users
 could still use iceberg 1.7+ after upgrading JDK.

 On Wed, Jul 24, 2024 at 9:40 AM Manu Zhang 
 wrote:

> Not sure about other engines but Spark has JDK 8 support till 3.5,
> which looks like a LTS version.
> If we drop JDK 8 support in 1.7, can Iceberg 1.7+ work seamlessly with
> Spark 3.5? Otherwise, users might get stuck in 1.6.
>
>>

clarification on changelog behavior for equality deletes

2024-08-20 Thread Wing Yew Poon

Hi,

I have a PR open to add changelog support for the case where delete files
are present (https://github.com/apache/iceberg/pull/10935). I have a
question about what the changelog should emit in the following scenario:

The table has a schema with a primary key/identifier column PK and
additional column V.
In snapshot 1, we write a data file DF1 with rows
PK1, V1
PK2, V2
etc.
In snapshot 2, we write an equality delete file ED1 with PK=PK1, and new
data file DF2 with rows
PK1, V1b
(possibly other rows)
In snapshot 3, we write an equality delete file ED2 with PK=PK1, and new
data file DF3 with rows
PK1, V1c
(possibly other rows)

Thus, in snapshot 2 and snapshot 3, we update the row identified by PK1
with new values by using an equality delete and writing new data for the
row.
These are the files present in snapshot 3:
DF1 (sequence number 1)
DF2 (sequence number 2)
DF3 (sequence number 3)
ED1 (sequence number 2)
ED2 (sequence number 3)

The question I have is what should the changelog emit for snapshot 3?
For snapshot 1, the changelog should emit a row for each row in DF1 as
INSERTED.
For snapshot 2, it should emit a row for PK1, V1 as DELETED; and a row for
PK1, V1b as INSERTED.
For snapshot 3, I see two possibilities:
(a)
PK1,V1b,DELETED
PK1,V1c,INSERTED

(b)
PK1,V1,DELETED
PK1,V1b,DELETED
PK1,V1c,INSERTED

The interpretation for (b) is that both ED1 and ED2 apply to DF1, with ED1
being an existing delete file and ED2 being an added delete file for it. We
discount ED1 and apply ED2 and get a DELETED row for PK1,V1.
ED2 also applies to DF2, from which we get a DELETED row for PK1, V1b.

The interpretation for (a) is that ED1 is an existing delete file for DF1
and in snapshot 3, the row PK1,V1 already does not exist before the
snapshot. Thus we do emit a row for it. (We can think of it as ED1 is
already applied to DF1, and we only consider any additional rows that get
deleted when ED2 is applied.)

I lean towards (a), as I think it is more reflective of net changes.
I am interested to hear what folks think.

Thank you,
Wing Yew

Re: Shall we start a regular community sync up?

2020-12-01 Thread Wing Yew Poon

I'd like to attend the community syncs as well. Can you please send me an
invite?
Thanks,
Wing Yew Poon

On Thu, Nov 19, 2020 at 9:25 PM Chitresh Kakwani 
wrote:

> Hi Ryan,
>
> Could you please add me to the invitation list as well ? New entrant.
> Interested in Iceberg's roadmap.
>
> Regards,
> Chitresh Kakwani
>
> On Thu, Nov 19, 2020 at 6:21 PM Vivekanand Vellanki 
> wrote:
>
>> Hi Ryan,
>>
>> I'd like to attend the regular community syncs. Can you send me an invite?
>>
>> Thanks
>> Vivek
>>
>> On Mon, Jun 15, 2020 at 11:16 PM Edgar Rodriguez
>>  wrote:
>>
>>> Hi Ryan,
>>>
>>> I'd like to attend the regular community syncs, could you send me
>>> an invite?
>>>
>>> Thanks!
>>>
>>> - Edgar
>>>
>>> On Wed, Mar 25, 2020 at 6:38 PM Ryan Blue 
>>> wrote:
>>>
>>>> Will do.
>>>>
>>>> On Wed, Mar 25, 2020 at 6:36 PM Jun Ma  wrote:
>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> Thanks for driving the sync up meeting. Could you please add Fan Diao(
>>>>> fan.dia...@gmail.com) and myself to the invitation list?
>>>>>
>>>>> Thanks,
>>>>> Jun Ma
>>>>>
>>>>> On Mon, Mar 23, 2020 at 9:57 PM OpenInx  wrote:
>>>>>
>>>>>> Hi Ryan
>>>>>>
>>>>>> I received your invitation. Some guys from our Flink teams also want
>>>>>> to join the hangouts  meeting. Do we need
>>>>>> also send an extra invitation to them ?  Or could them just join the
>>>>>> meeting with entering the meeting address[1] ?
>>>>>>
>>>>>> If need so, please let the following guys in:
>>>>>> 1. ykt...@gmail.com
>>>>>> 2. imj...@gmail.com
>>>>>> 3. yuzhao@gmail.com
>>>>>>
>>>>>> BTW,  I've written a draft to discuss in the meeting [2],  anyone
>>>>>> could enrich the topics want to discuss.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> [1]. https://meet.google.com/_meet/xdx-rknm-uvm
>>>>>> [2].
>>>>>> https://docs.google.com/document/d/1wXTHGYhc7sDhP5DxlByba0S5YguNLWwY98FAp6Tx2mw/edit#
>>>>>>
>>>>>> On Mon, Mar 23, 2020 at 5:35 AM Ryan Blue 
>>>>>> wrote:
>>>>>>
>>>>>>> I invited everyone that replied to this thread and the people that
>>>>>>> were on the last invite.
>>>>>>>
>>>>>>> If you have specific topics you'd like to put on the agenda, please
>>>>>>> send them to me!
>>>>>>>
>>>>>>> On Sun, Mar 22, 2020 at 2:28 PM Ryan Blue  wrote:
>>>>>>>
>>>>>>>> Let's go with Wednesday. I'll send out an invite.
>>>>>>>>
>>>>>>>> On Sun, Mar 22, 2020 at 1:36 PM John Zhuge 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> 5-5:30 pm work for me. Prefer Wednesdays.
>>>>>>>>>
>>>>>>>>> On Sun, Mar 22, 2020 at 1:33 PM Romin Parekh <
>>>>>>>>> rominpar...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi folks,
>>>>>>>>>>
>>>>>>>>>> Both times slots work for me next week. Can we confirm a day?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Romin
>>>>>>>>>>
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>
>>>>>>>>>> > On Mar 20, 2020, at 11:38 PM, Jun H. 
>>>>>>>>>> wrote:
>>>>>>>>>> >
>>>>>>>>>> > The schedule works for me.
>>>>>>>>>> >
>>>>>>>>>> >> On Thu, Mar 19, 2020 at 6:55 PM Junjie Chen <
>>>>>>>>>> chenjunjied...@gmail.com> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >> The same time works for me as well.
>>>>>>>>>> >>
>>>>>>>>&

Re: About schema evolution with time travel.

2020-12-14 Thread Wing Yew Poon

Hi Tianyi,
The behavior you found is indeed the current behavior in Iceberg. I too
found it unexpected. I have a PR to address this:
https://github.com/apache/iceberg/pull/1508. Due to other work, I had not
followed up on this for a while, but I am returning to it now.
- Wing Yew


On Mon, Dec 14, 2020 at 6:27 AM Cap Kurmagati 
wrote:

> Hi,
>
> I have a question regarding the behavior of schema evolution with
> time-travel in Iceberg.
> When I do a time-travel query against a table with schema changes.
> I expect that the result is structured using the schema. But it turned out
> to be structured using the current schema.
>
> Is this an expected behavior?
> I think it would be nice to be able to query the data in its original
> shape. What do you think?
>
> Code snippet as follows. Environment: Iceberg 0.10.0, Spark 3.0.1
>
> sql("create table iceberg.test.schema_timetravel (id int, name string)
> using iceberg")
> sql("insert into table iceberg.test.schema_timetravel values(1, 'aaa')")
> sql("insert into table iceberg.test.schema_timetravel values(2, 'bbb')")
> sql("select * from iceberg.test.schema_timetravel").show()
> +---+---+
> | id|   name|
> +---+---+
> |  1|aaa|
> |  2|bbb|
> +---+---+
> sql("select * from iceberg.test.schema_timetravel.history").show()
>
> ++---+---+---+
> | made_current_at|snapshot_id|
>  parent_id|is_current_ancestor|
>
> ++---+---+---+
> |2020-12-14 22:44:...|2849000299888498484|   null|
>   true|
> |2020-12-14 22:44:...|5610242355805640211|2849000299888498484|
>   true|
>
> ++---+---+---+
> sql("alter table iceberg.test.schema_timetravel drop column name")
> sql("select * from iceberg.test.schema_timetravel").show()
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+
> spark.read.format("iceberg").option("snapshot-id",
> 2849000299888498484L).load("test.schema_timetravel").show()
> // Expect: show data in the previous schema: (1, aaa)
> // Result: show data in the current schema: (1)
> +---+
> | id|
> +---+
> |  1|
> +---+
>
> Best regards,
> Tianyi
>

Re: Welcoming Peter Vary as a new committer!

2021-01-25 Thread Wing Yew Poon

Congratulations Peter!


On Mon, Jan 25, 2021 at 10:35 AM Russell Spitzer 
wrote:

> Congratulations!
>
> On Jan 25, 2021, at 12:34 PM, Jacques Nadeau 
> wrote:
>
> Congrats Peter! Thanks for all your great work
>
> On Mon, Jan 25, 2021 at 10:24 AM Ryan Blue  wrote:
>
>> Hi everyone,
>>
>> I'd like to welcome Peter Vary as a new Iceberg committer.
>>
>> Thanks for all your contributions, Peter!
>>
>> rb
>>
>> --
>> Ryan Blue
>>
>
>

Re: Welcoming Yan Yan as a new committer!

2021-03-24 Thread Wing Yew Poon

Congratulations Yan!


On Wed, Mar 24, 2021 at 1:36 PM Ryan Murray  wrote:

> Congratulations!!
>
> On Wed, 24 Mar 2021, 11:39 Szehon Ho,  wrote:
>
>> Nice, congratulations!
>>
>> On 24 Mar 2021, at 11:37, Marton Bod  wrote:
>>
>> Congratulations, well done!
>>
>> On Wed, 24 Mar 2021 at 11:32, Peter Vary 
>> wrote:
>>
>>> Congratulations Yan!
>>>
>>> On Mar 24, 2021, at 05:43, Yufei Gu  wrote:
>>>
>>> Congratulations, Yan!
>>>
>>> Best,
>>>
>>> Yufei
>>>
>>> `This is not a contribution`
>>>
>>>
>>> On Tue, Mar 23, 2021 at 8:44 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 Congratulations!

 On Mar 23, 2021, at 9:35 PM, OpenInx  wrote:

 Congrats Yan !   You deserve it.

 On Wed, Mar 24, 2021 at 7:18 AM Miao Wang 
 wrote:

> Congrats @Yan Yan !
>
>
>
> Miao
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"dev@iceberg.apache.org" 
> *Date: *Tuesday, March 23, 2021 at 3:43 PM
> *To: *Iceberg Dev List 
> *Subject: *Welcoming Yan Yan as a new committer!
>
>
>
> Hi everyone,
>
> I'd like to welcome Yan Yan as a new Iceberg committer.
>
> Thanks for all your contributions, Yan!
>
> rb
>
>
>
> --
>
> Ryan Blue
>


>>>
>>

Re: Compaction Sync - Monday

2021-04-19 Thread Wing Yew Poon

Russell,
Can you please add me too?
Thanks,
Wing Yew


On Mon, Apr 19, 2021 at 9:01 AM Russell Spitzer 
wrote:

> I officially moved the meeting to tonight 6PM (Pacific) or tomorrow
> morning 9AM (China ST) or 8PM (Central) -
> We all knew timezones were going to be the hard part of computer science :)
>
>
> Sorry for the late notice but I wanted to make sure that everyone who is
> interested can attend,
> Russ
>
> On Apr 19, 2021, at 9:13 AM, Jack Ye  wrote:
>
> Hi Russell, could you add me to the meeting with this email?
> yezhao...@gmail.com
>
> Thanks,
> Jack Ye
>
>
> On Sun, Apr 18, 2021, 8:14 PM Russell Spitzer 
> wrote:
>
>> Added everyone, I'm sorry this request is happening so late, I need to go
>> to bed so but i'll check to see what has happened overnight tomorrow
>> morning.
>>
>> On Apr 18, 2021, at 10:12 PM, Xinbin Huang  wrote:
>>
>> Hi Russell,
>>
>> Can you also add me to the meeting? I am available at either the original
>> or the new proposed time.
>>
>> Thanks
>> Bin
>>
>> On Sun, Apr 18, 2021 at 7:21 PM OpenInx  wrote:
>>
>>> Thanks for the  adjustment.  If 20:00 for you is too late,  we could
>>> move this earlier.  I mean 7:00 AM(Beijing),  8:00 AM(Beijing), 9:00
>>> AM(Beijing) , all those time works for me.
>>>
>>> On Mon, Apr 19, 2021 at 10:06 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 If we could have a quick confirmation from everyone who wants to attend
 I would be glad to move it to 18:00 Pacific time although for me that's
 20:00 :) If everyone is ok with that I would be glad to swap, if not I can
 always be online then and have a second catchup meeting.

 Thanks for responding and I'll add you to the original event!
 Russ

 On Apr 18, 2021, at 9:00 PM, OpenInx  wrote:

 Thanks for pinging me.

 I'd like to attend that meeting,   but the time is April 20 at 00:00
 AM  Beijing time,  Junjie and I may need to get up in the middle of the
 night to attend this meeting.

 It would be better if we could adjust the time, but if everyone has
 recognized this point in time, we can also follow it (although it may
 interrupt sleep).

 btw, I think the availability time zone that matches Beijing time and
 Pacific time is:

 7:00 AM (beijing)  ->   16:00 PM (pacific)
 
 9:00 AM  (beijing) ->   18:00 PM (pacific)



 On Thu, Apr 15, 2021 at 9:58 PM Junjie Chen 
 wrote:

> I will try to attend this. Thanks for ping me.
>
> On Thu, Apr 15, 2021 at 10:32 AM Anton Okolnychyi <
> aokolnyc...@apple.com.invalid> wrote:
>
>> +1 for a meeting to discuss the compaction work.
>>
>> It would be great if Zheng and Junjie could make it. We can adjust
>> the time if needed.
>>
>> Thanks,
>> Anton
>>
>> On 14 Apr 2021, at 17:01, Russell Spitzer 
>> wrote:
>>
>> Hi Everybody!
>>
>> We've been spending a bunch of time recently finishing up our
>> compaction proposal and we were hoping to host a quick meetup on monday 
>> to
>> try to finish and gain consensus. We've made progress on the Spark Side 
>> and
>> have what we think is a reasonable way forward with the DSV2 api. We have
>> included sections on Delete Files as well which will mostly be leaving 
>> for
>> future work.
>>
>> I'll be finishing up our last set of internal notes on the Design Doc
>> tomorrow see
>>
>>
>> https://docs.google.com/document/d/1aXo1VzuXxSuqcTzMLSQdnivMVgtLExgDWUFMvWeXRxc/edit?ts=600b0432#
>>
>> I was planning on having this meeting on Monday April 19 at 9 AM
>> Pacific time (11 AM Central Time) but we can always change this if need 
>> be.
>>
>> Please let me know if you would like to attend and I will add you to
>> the event directly.
>>
>> Thanks so much for your time,
>> Russ
>>
>>
>>
>
> --
> Best Regards
>


>>
>

Re: Welcoming Jack Ye as a new committer!

2021-07-05 Thread Wing Yew Poon

Congratulations Jack!


On Mon, Jul 5, 2021 at 11:35 AM Ryan Blue  wrote:

> Hi everyone,
>
> I'd like to welcome Jack Ye as a new Iceberg committer.
>
> Thanks for all your contributions, Jack!
>
> Ryan
>
> --
> Ryan Blue
>

Re: [VOTE] Release Apache Iceberg 0.12.0 RC2

2021-08-07 Thread Wing Yew Poon

Sorry to bring this up so late, but this just came up: there is a Spark 3.1
(runtime) compatibility issue (not found by existing tests), which I have a
fix for in https://github.com/apache/iceberg/pull/2954. I think it would be
really helpful if it can go into 0.12.0.
- Wing Yew


On Fri, Aug 6, 2021 at 11:36 AM Jack Ye  wrote:

> +1 (non-binding)
>
> Verified release test and AWS integration test, issue found in test but
> not blocking for release (https://github.com/apache/iceberg/pull/2948)
>
> Verified Spark 3.1 and 3.0 operations and new SQL extensions and
> procedures on EMR.
>
> Thanks,
> Jack Ye
>
> On Fri, Aug 6, 2021 at 1:19 AM Kyle Bendickson 
> wrote:
>
>> +1 (binding)
>>
>> I verified:
>>  - KEYS signature & checksum
>>  - ./gradlew clean build (tests, etc)
>>  - Ran Spark jobs on Kubernetes after building from the tarball at
>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.0-rc2/
>>  - Spark 3.1.1 batch jobs against both Hadoop and Hive tables, using
>> HMS for Hive catalog
>>  - Verified default FileIO and S3FileIO
>>  - Basic read and writes
>>  - Jobs using Spark procedures (remove unreachable files)
>>  - Special mention: verified that Spark catalogs can override hadoop
>> configurations using configs prefixed with
>> "spark.sql.catalog.(catalog-name).hadoop."
>>  - one of my contributions to this release that has been asked about
>> by several customers internally
>>  - tested using `spark.sql.catalog.(catalog-name).hadoop.fs.s3a.impl`
>> for two catalogs, both values respected as opposed to the default globally
>> configured value
>>
>> Thank you Carl!
>>
>> - Kyle, Data OSS Dev @ Apple =)
>>
>> On Thu, Aug 5, 2021 at 11:49 PM Szehon Ho 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> * Verify Signature Keys
>>> * Verify Checksum
>>> * dev/check-license
>>> * Build
>>> * Run tests (though some timeout failures, on Hive MR test..)
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Thu, Aug 5, 2021 at 2:23 PM Daniel Weeks  wrote:
>>>
 +1 (binding)

 I verified sigs/sums, license, build, and test

 -Dan

 On Wed, Aug 4, 2021 at 2:53 PM Ryan Murray  wrote:

> After some wrestling w/ Spark I discovered that the problem was with
> my test. Some SparkSession apis changed. so all good here now.
>
> +1 (non-binding)
>
> On Wed, Aug 4, 2021 at 11:29 PM Ryan Murray  wrote:
>
>> Thanks for the help Carl, got it sorted out. The gpg check now works.
>> For those who were interested I used a canned wget command in my history
>> and it pulled the RC0 :-)
>>
>> Will have a PR to fix the Nessie Catalog soon.
>>
>> Best,
>> Ryan
>>
>> On Wed, Aug 4, 2021 at 9:21 PM Carl Steinbach 
>> wrote:
>>
>>> Hi Ryan,
>>>
>>> Can you please run the following command to see which keys in your
>>> public keyring are associated with my UID?
>>>
>>> % gpg  --list-keys c...@apache.org
>>> pub   rsa4096/5A5C7F6EB9542945 2021-07-01 [SC]
>>>   160F51BE45616B94103ED24D5A5C7F6EB9542945
>>> uid [ultimate] Carl W. Steinbach (CODE SIGNING KEY) <
>>> c...@apache.org>
>>> sub   rsa4096/4158EB8A4F03D2AA 2021-07-01 [E]
>>>
>>> Thanks.
>>>
>>> - Carl
>>>
>>> On Wed, Aug 4, 2021 at 11:12 AM Ryan Murray 
>>> wrote:
>>>
 Hi all,

 Unfortunately I have to give -1

 I had trouble w/ the keys:

 gpg: assuming signed data in 'apache-iceberg-0.12.0.tar.gz'
 gpg: Signature made Mon 02 Aug 2021 03:36:30 CEST
 gpg:using RSA key
 FAFEB6EAA60C95E2BB5E26F01FF0803CB78D539F
 gpg: Can't check signature: No public key

 And I have discovered a bug in NessieCatalog. It is unclear what is
 wrong but the NessieCatalog doesn't play nice w/ Spark3.1. I will 
 raise a
 patch ASAP to fix it. Very sorry for the inconvenience.

 Best,
 Ryan

 On Wed, Aug 4, 2021 at 3:20 AM Carl Steinbach 
 wrote:

> Hi everyone,
>
> I propose that we release RC2 as the official Apache Iceberg
> 0.12.0 release. Please note that RC0 and RC1 were DOA.
>
> The commit id for RC2 is 7c2fcfd893ab71bee41242b46e894e6187340070
> * This corresponds to the tag: apache-iceberg-0.12.0-rc2
> *
> https://github.com/apache/iceberg/commits/apache-iceberg-0.12.0-rc2
> *
> https://github.com/apache/iceberg/tree/7c2fcfd893ab71bee41242b46e894e6187340070
>
> The release tarball, signature, and checksums are here:
> *
> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.0-rc2/
>
> You can find the KEYS file here:
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are sta

Re: [VOTE] Release Apache Iceberg 0.12.0 RC2

2021-08-09 Thread Wing Yew Poon

Ryan,
Thanks for the review. Let me look into implementing your refactoring
suggestion.
- Wing Yew


On Mon, Aug 9, 2021 at 8:41 AM Ryan Blue  wrote:

> Yeah, I agree. We should fix this for the 0.12.0 release. That said, I
> plan to continue testing this RC because it won't change that much since
> this affects the Spark extensions in 3.1. Other engines and Spark 3.0 or
> older should be fine.
>
> I left a comment on the PR. I think it looks good, but we should try to
> refactor to make sure we don't have more issues like this. I think when we
> update our extensions to be compatible with multiple Spark versions, we
> should introduce a factory method to create the Catalyst plan node and use
> that everywhere. That will hopefully cut down on the number of times this
> happens.
>
> Thank you, Wing Yew!
>
> On Sun, Aug 8, 2021 at 2:52 PM Carl Steinbach 
> wrote:
>
>> Hi Wing Yew,
>>
>> I will create a new RC once this patch is committed.
>>
>> Thanks.
>>
>> - Carl
>>
>> On Sat, Aug 7, 2021 at 4:29 PM Wing Yew Poon 
>> wrote:
>>
>>> Sorry to bring this up so late, but this just came up: there is a Spark
>>> 3.1 (runtime) compatibility issue (not found by existing tests), which I
>>> have a fix for in https://github.com/apache/iceberg/pull/2954. I think
>>> it would be really helpful if it can go into 0.12.0.
>>> - Wing Yew
>>>
>>>
>>> On Fri, Aug 6, 2021 at 11:36 AM Jack Ye  wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> Verified release test and AWS integration test, issue found in test but
>>>> not blocking for release (https://github.com/apache/iceberg/pull/2948)
>>>>
>>>> Verified Spark 3.1 and 3.0 operations and new SQL extensions and
>>>> procedures on EMR.
>>>>
>>>> Thanks,
>>>> Jack Ye
>>>>
>>>> On Fri, Aug 6, 2021 at 1:19 AM Kyle Bendickson 
>>>> wrote:
>>>>
>>>>> +1 (binding)
>>>>>
>>>>> I verified:
>>>>>  - KEYS signature & checksum
>>>>>  - ./gradlew clean build (tests, etc)
>>>>>  - Ran Spark jobs on Kubernetes after building from the tarball at
>>>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.0-rc2/
>>>>>  - Spark 3.1.1 batch jobs against both Hadoop and Hive tables,
>>>>> using HMS for Hive catalog
>>>>>  - Verified default FileIO and S3FileIO
>>>>>  - Basic read and writes
>>>>>  - Jobs using Spark procedures (remove unreachable files)
>>>>>  - Special mention: verified that Spark catalogs can override hadoop
>>>>> configurations using configs prefixed with
>>>>> "spark.sql.catalog.(catalog-name).hadoop."
>>>>>  - one of my contributions to this release that has been asked
>>>>> about by several customers internally
>>>>>  - tested using
>>>>> `spark.sql.catalog.(catalog-name).hadoop.fs.s3a.impl` for two catalogs,
>>>>> both values respected as opposed to the default globally configured value
>>>>>
>>>>> Thank you Carl!
>>>>>
>>>>> - Kyle, Data OSS Dev @ Apple =)
>>>>>
>>>>> On Thu, Aug 5, 2021 at 11:49 PM Szehon Ho 
>>>>> wrote:
>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> * Verify Signature Keys
>>>>>> * Verify Checksum
>>>>>> * dev/check-license
>>>>>> * Build
>>>>>> * Run tests (though some timeout failures, on Hive MR test..)
>>>>>>
>>>>>> Thanks
>>>>>> Szehon
>>>>>>
>>>>>> On Thu, Aug 5, 2021 at 2:23 PM Daniel Weeks 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 (binding)
>>>>>>>
>>>>>>> I verified sigs/sums, license, build, and test
>>>>>>>
>>>>>>> -Dan
>>>>>>>
>>>>>>> On Wed, Aug 4, 2021 at 2:53 PM Ryan Murray  wrote:
>>>>>>>
>>>>>>>> After some wrestling w/ Spark I discovered that the problem was
>>>>>>>> with my test. Some SparkSession apis changed. so all good here now.
>>>>>>>>
>>>>>>>> +1 (non-binding)
>>>>>>>>
>>>&

Re: [VOTE] Release Apache Iceberg 0.12.0 RC2

2021-08-09 Thread Wing Yew Poon

https://github.com/apache/iceberg/pull/2954 should be ready to merge. The
CI passed.


On Mon, Aug 9, 2021 at 9:08 AM Wing Yew Poon  wrote:

> Ryan,
> Thanks for the review. Let me look into implementing your refactoring
> suggestion.
> - Wing Yew
>
>
> On Mon, Aug 9, 2021 at 8:41 AM Ryan Blue  wrote:
>
>> Yeah, I agree. We should fix this for the 0.12.0 release. That said, I
>> plan to continue testing this RC because it won't change that much since
>> this affects the Spark extensions in 3.1. Other engines and Spark 3.0 or
>> older should be fine.
>>
>> I left a comment on the PR. I think it looks good, but we should try to
>> refactor to make sure we don't have more issues like this. I think when we
>> update our extensions to be compatible with multiple Spark versions, we
>> should introduce a factory method to create the Catalyst plan node and use
>> that everywhere. That will hopefully cut down on the number of times this
>> happens.
>>
>> Thank you, Wing Yew!
>>
>> On Sun, Aug 8, 2021 at 2:52 PM Carl Steinbach 
>> wrote:
>>
>>> Hi Wing Yew,
>>>
>>> I will create a new RC once this patch is committed.
>>>
>>> Thanks.
>>>
>>> - Carl
>>>
>>> On Sat, Aug 7, 2021 at 4:29 PM Wing Yew Poon 
>>> wrote:
>>>
>>>> Sorry to bring this up so late, but this just came up: there is a Spark
>>>> 3.1 (runtime) compatibility issue (not found by existing tests), which I
>>>> have a fix for in https://github.com/apache/iceberg/pull/2954. I think
>>>> it would be really helpful if it can go into 0.12.0.
>>>> - Wing Yew
>>>>
>>>>
>>>> On Fri, Aug 6, 2021 at 11:36 AM Jack Ye  wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> Verified release test and AWS integration test, issue found in test
>>>>> but not blocking for release (
>>>>> https://github.com/apache/iceberg/pull/2948)
>>>>>
>>>>> Verified Spark 3.1 and 3.0 operations and new SQL extensions and
>>>>> procedures on EMR.
>>>>>
>>>>> Thanks,
>>>>> Jack Ye
>>>>>
>>>>> On Fri, Aug 6, 2021 at 1:19 AM Kyle Bendickson 
>>>>> wrote:
>>>>>
>>>>>> +1 (binding)
>>>>>>
>>>>>> I verified:
>>>>>>  - KEYS signature & checksum
>>>>>>  - ./gradlew clean build (tests, etc)
>>>>>>  - Ran Spark jobs on Kubernetes after building from the tarball at
>>>>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.0-rc2/
>>>>>>  - Spark 3.1.1 batch jobs against both Hadoop and Hive tables,
>>>>>> using HMS for Hive catalog
>>>>>>  - Verified default FileIO and S3FileIO
>>>>>>  - Basic read and writes
>>>>>>  - Jobs using Spark procedures (remove unreachable files)
>>>>>>  - Special mention: verified that Spark catalogs can override hadoop
>>>>>> configurations using configs prefixed with
>>>>>> "spark.sql.catalog.(catalog-name).hadoop."
>>>>>>  - one of my contributions to this release that has been asked
>>>>>> about by several customers internally
>>>>>>  - tested using
>>>>>> `spark.sql.catalog.(catalog-name).hadoop.fs.s3a.impl` for two catalogs,
>>>>>> both values respected as opposed to the default globally configured value
>>>>>>
>>>>>> Thank you Carl!
>>>>>>
>>>>>> - Kyle, Data OSS Dev @ Apple =)
>>>>>>
>>>>>> On Thu, Aug 5, 2021 at 11:49 PM Szehon Ho 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> * Verify Signature Keys
>>>>>>> * Verify Checksum
>>>>>>> * dev/check-license
>>>>>>> * Build
>>>>>>> * Run tests (though some timeout failures, on Hive MR test..)
>>>>>>>
>>>>>>> Thanks
>>>>>>> Szehon
>>>>>>>
>>>>>>> On Thu, Aug 5, 2021 at 2:23 PM Daniel Weeks 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 (binding)
>>>>>>>>
>>>>>>&

Re: [DISCUSS] Spark version support strategy

2021-09-14 Thread Wing Yew Poon

I understand and sympathize with the desire to use new DSv2 features in
Spark 3.2. I agree that Option 1 is the easiest for developers, but I don't
think it considers the interests of users. I do not think that most users
will upgrade to Spark 3.2 as soon as it is released. It is a "minor
version" upgrade in name from 3.1 (or from 3.0), but I think we all know
that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1
and from 3.1 to 3.2. I think there are even a lot of users running Spark
2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark
2.4?

Please correct me if I'm mistaken, but the folks who have spoken out in
favor of Option 1 all work for the same organization, don't they? And they
don't have a problem with making their users, all internal, simply upgrade
to Spark 3.2, do they? (Or they are already running an internal fork that
is close to 3.2.)

I work for an organization with customers running different versions of
Spark. It is true that we can backport new features to older versions if we
wanted to. I suppose the people contributing to Iceberg work for some
organization or other that either use Iceberg in-house, or provide software
(possibly in the form of a service) to customers, and either way, the
organizations have the ability to backport features and fixes to internal
versions. Are there any users out there who simply use Apache Iceberg and
depend on the community version?

There may be features that are broadly useful that do not depend on Spark
3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?

I am not in favor of Option 2. I do not oppose Option 1, but I would
consider Option 3 too. Anton, you said 5 modules are required; what are the
modules you're thinking of?

- Wing Yew

On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu  wrote:

> Option 1 sounds good to me. Here are my reasons:
>
> 1. Both 2 and 3 will slow down the development. Considering the limited
> resources in the open source community, the upsides of option 2 and 3 are
> probably not worthy.
> 2. Both 2 and 3 assume the use cases may not exist. It's hard to predict
> anything, but even if these use cases are legit, users can still get the
> new feature by backporting it to an older version in case of upgrading to a
> newer version isn't an option.
>
> Best,
>
> Yufei
>
> `This is not a contribution`
>
>
> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi
>  wrote:
>
>> To sum up what we have so far:
>>
>>
>> *Option 1 (support just the most recent minor Spark 3 version)*
>>
>> The easiest option for us devs, forces the user to upgrade to the most
>> recent minor Spark version to consume any new Iceberg features.
>>
>> *Option 2 (a separate project under Iceberg)*
>>
>> Can support as many Spark versions as needed and the codebase is still
>> separate as we can use separate branches.
>> Impossible to consume any unreleased changes in core, may slow down the
>> development.
>>
>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>
>> Introduce more modules in the same project.
>> Can consume unreleased changes but it will required at least 5 modules to
>> support 2.4, 3.1 and 3.2, making the build and testing complicated.
>>
>>
>> Are there any users for whom upgrading the minor Spark version (e3.1 to
>> 3.2) to consume new features is a blocker?
>> We follow Option 1 internally at the moment but I would like to hear what
>> other people think/need.
>>
>> - Anton
>>
>>
>> On 14 Sep 2021, at 09:44, Russell Spitzer 
>> wrote:
>>
>> I think we should go for option 1. I already am not a big fan of having
>> runtime errors for unsupported things based on versions and I don't think
>> minor version upgrades are a large issue for users.  I'm especially not
>> looking forward to supporting interfaces that only exist in Spark 3.2 in a
>> multiple Spark version support future.
>>
>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>> aokolnyc...@apple.com.INVALID> wrote:
>>
>> First of all, is option 2 a viable option? We discussed separating the
>> python module outside of the project a few weeks ago, and decided to not do
>> that because it's beneficial for code cross reference and more intuitive
>> for new developers to see everything in the same repository. I would expect
>> the same argument to also hold here.
>>
>>
>> That’s exactly the concern I have about Option 2 at this moment.
>>
>> Overall I would personally prefer us to not support all the minor
>> versions, but instead support maybe just 2-3 latest versions in a major
>> version.
>>
>>
>> This is when it gets a bit complicated. If we want to support both Spark
>> 3.1 and Spark 3.2 with a single module, it means we have to compile against
>> 3.1. The problem is that we rely on DSv2 that is being actively developed.
>> 3.2 and 3.1 have substantial differences. On top of that, we have our
>> extensions that are extremely low-level and may break not only between
>> minor versions but also between patch releases.
>>
>> f there ar

Re: [DISCUSS] Spark version support strategy

2021-09-15 Thread Wing Yew Poon

st
> trying to build a library.
>
> Ryan
>
> On Wed, Sep 15, 2021 at 2:58 AM OpenInx  wrote:
>
>> Thanks for bringing this up,  Anton.
>>
>> Everyone has great pros/cons to support their preferences.  Before giving
>> my preference, let me raise one question:what's the top priority thing
>> for apache iceberg project at this point in time ?  This question will help
>> us to answer the following question: Should we support more engine versions
>> more robustly or be a bit more aggressive and concentrate on getting the
>> new features that users need most in order to keep the project more
>> competitive ?
>>
>> If people watch the apache iceberg project and check the issues &
>> PR frequently,  I guess more than 90% people will answer the priority
>> question:   There is no doubt for making the whole v2 story to be
>> production-ready.   The current roadmap discussion also proofs the thing :
>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>> .
>>
>> In order to ensure the highest priority at this point in time, I will
>> prefer option-1 to reduce the cost of engine maintenance, so as to free up
>> resources to make v2 production-ready.
>>
>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao 
>> wrote:
>>
>>> From Dev's point, it has less burden to always support the latest
>>> version of Spark (for example). But from user's point, especially for us
>>> who maintain Spark internally, it is not easy to upgrade the Spark version
>>> for the first time (since we have many customizations internally), and
>>> we're still promoting to upgrade to 3.1.2. If the community ditches the
>>> support of old version of Spark3, users have to maintain it themselves
>>> unavoidably.
>>>
>>> So I'm inclined to make this support in community, not by users
>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
>>> burden, we could support limited versions of Spark (for example 2 versions).
>>>
>>> Just my two cents.
>>>
>>> -Saisai
>>>
>>>
>>> Jack Ye  于2021年9月15日周三 下午1:35写道：
>>>
>>>> Hi Wing Yew,
>>>>
>>>> I think 2.4 is a different story, we will continue to support Spark
>>>> 2.4, but as you can see it will continue to have very limited
>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>> consistent strategy around this, let's take this chance to make a good
>>>> community guideline for all future engine versions, especially for Spark,
>>>> Flink and Hive that are in the same repository.
>>>>
>>>> I can totally understand your point of view Wing, in fact, speaking
>>>> from the perspective of AWS EMR, we have to support over 40 versions of the
>>>> software because there are people who are still using Spark 1.4, believe it
>>>> or not. After all, keep backporting changes will become a liability not
>>>> only on the user side, but also on the service provider side, so I believe
>>>> it's not a bad practice to push for user upgrade, as it will make the life
>>>> of both parties easier in the end. New feature is definitely one of the
>>>> best incentives to promote an upgrade on user side.
>>>>
>>>> I think the biggest issue of option 3 is about its scalability, because
>>>> we will have an unbounded list of packages to add and compile in the
>>>> future, and we probably cannot drop support of that package once created.
>>>> If we go with option 1, I think we can still publish a few patch versions
>>>> for old Iceberg releases, and committers can control the amount of patch
>>>> versions to guard people from abusing the power of patching. I see this as
>>>> a consistent strategy also for Flink and Hive. With this strategy, we can
>>>> truly have a compatibility matrix for engine versions against Iceberg
>>>> versions.
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon
>>>>  wrote:
>>>>
>>>>> I understand and sympathize with the desire to use new DSv2 features
>>>>> in Spark 3.2. I agre

Re: [DISCUSS] Spark version support strategy

2021-09-28 Thread Wing Yew Poon

  /runtime/...
>>>>   /3.1/core/...
>>>> /extension/...
>>>> /runtime/...
>>>>
>>>> The gradle build script in the root is configured to build against the
>>>> latest version of Spark by default, unless otherwise specified by the user.
>>>>
>>>> Intellij can also be configured to only index files of specific
>>>> versions based on the same config used in build.
>>>>
>>>> In this way, I imagine the CI setup to be much easier to do things like
>>>> testing version compatibility for a feature or running only a
>>>> specific subset of Spark version builds based on the Spark version
>>>> directories touched.
>>>>
>>>> And the biggest benefit is that we don't have the same difficulty as
>>>> option 2 of developing a feature when it's both in core and Spark.
>>>>
>>>> We can then develop a mechanism to vote to stop support of certain
>>>> versions, and archive the corresponding directory to avoid accumulating too
>>>> many versions in the long term.
>>>>
>>>> -Jack Ye
>>>>
>>>>
>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue  wrote:
>>>>
>>>>> Sorry, I was thinking about CI integration between Iceberg Java and
>>>>> Iceberg Spark, I just didn't mention it and I see how that's a big thing 
>>>>> to
>>>>> leave out!
>>>>>
>>>>> I would definitely want to test the projects together. One thing we
>>>>> could do is have a nightly build like Russell suggests. I'm also wondering
>>>>> if we could have some tighter integration where the Iceberg Spark build 
>>>>> can
>>>>> be included in the Iceberg Java build using properties. Maybe the github
>>>>> action could checkout Iceberg, then checkout the Spark integration's 
>>>>> latest
>>>>> branch, and then run the gradle build with a property that makes Spark a
>>>>> subproject in the build. That way we can continue to have Spark CI run
>>>>> regularly.
>>>>>
>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> I agree that Option 2 is considerably more difficult for development
>>>>>> when core API changes need to be picked up by the external Spark module. 
>>>>>> I
>>>>>> also think a monthly release would probably still be prohibitive to
>>>>>> actually implementing new features that appear in the API, I would hope 
>>>>>> we
>>>>>> have a much faster process or maybe just have snapshot artifacts 
>>>>>> published
>>>>>> nightly?
>>>>>>
>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <
>>>>>> wyp...@cloudera.com.INVALID> wrote:
>>>>>>
>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>>> separate repo (subproject of Iceberg). Would we have branches such as
>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be
>>>>>> supported in all versions or all Spark 3 versions, then we would need to
>>>>>> commit the changes to all applicable branches. Basically we are trading
>>>>>> more work to commit to multiple branches for simplified build and CI
>>>>>> time per branch, which might be an acceptable trade-off. However, the
>>>>>> biggest downside is that changes may need to be made in core Iceberg as
>>>>>> well as in the engine (in this case Spark) support, and we need to wait 
>>>>>> for
>>>>>> a release of core Iceberg to consume the changes in the subproject. In 
>>>>>> this
>>>>>> case, maybe we should have a monthly release of core Iceberg (no matter 
>>>>>> how
>>>>>> many changes go in, as long as it is non-zero) so that the subproject can
>>>>>> consume changes fairly quickly?
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue  wrote:
>>>>>>
>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the set of
>>>>

Re: [DISCUSS] Spark version support strategy

2021-09-28 Thread Wing Yew Poon

Hi OpenInx,
I'm sorry I misunderstood the thinking of the Flink community. Thanks for
the clarification.
- Wing Yew


On Tue, Sep 28, 2021 at 7:15 PM OpenInx  wrote:

> Hi Wing
>
> As we discussed above, we community prefer to choose option.2 or
> option.3.  So in fact, when we planned to upgrade the flink version from
> 1.12 to 1.13,  we are doing our best to guarantee the master iceberg repo
> could work fine for both flink1.12 & flink1.13. More context please see
> [1], [2], [3]
>
> [1] https://github.com/apache/iceberg/pull/3116
> [2] https://github.com/apache/iceberg/issues/3183
> [3]
> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E
>
>
> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon 
> wrote:
>
>> In the last community sync, we spent a little time on this topic. For
>> Spark support, there are currently two options under consideration:
>>
>> Option 2: Separate repo for the Spark support. Use branches for
>> supporting different Spark versions. Main branch for the latest Spark
>> version (3.2 to begin with).
>> Tooling needs to be built for producing regular snapshots of core Iceberg
>> in a consumable way for this repo. Unclear if commits to core Iceberg will
>> be tested pre-commit against Spark support; my impression is that they will
>> not be, and the Spark support build can be broken by changes to core.
>>
>> A variant of option 3 (which we will simply call Option 3 going forward):
>> Single repo, separate module (subdirectory) for each Spark version to be
>> supported. Code duplication in each Spark module (no attempt to refactor
>> out common code). Each module built against the specific version of Spark
>> to be supported, producing a runtime jar built against that version. CI
>> will test all modules. Support can be provided for only building the
>> modules a developer cares about.
>>
>> More input was sought and people are encouraged to voice their preference.
>> I lean towards Option 3.
>>
>> - Wing Yew
>>
>> ps. In the sync, as Steven Wu wrote, the question was raised if the same
>> multi-version support strategy can be adopted across engines. Based on what
>> Steven wrote, currently the Flink developer community's bandwidth makes
>> supporting only a single Flink version (and focusing resources on
>> developing new features on that version) the preferred choice. If so, then
>> no multi-version support strategy for Flink is needed at this time.
>>
>>
>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu  wrote:
>>
>>> During the sync meeting, people talked about if and how we can have the
>>> same version support model across engines like Flink and Spark. I can
>>> provide some input from the Flink side.
>>>
>>> Flink only supports two minor versions. E.g., right now Flink 1.13 is
>>> the latest released version. That means only Flink 1.12 and 1.13 are
>>> supported. Feature changes or bug fixes will only be backported to 1.12 and
>>> 1.13, unless it is a serious bug (like security). With that context,
>>> personally I like option 1 (with one actively supported Flink version in
>>> master branch) for the iceberg-flink module.
>>>
>>> We discussed the idea of supporting multiple Flink versions via shm
>>> layer and multiple modules. While it may be a little better to support
>>> multiple Flink versions, I don't know if there is enough support and
>>> resources from the community to pull it off. Also the ongoing maintenance
>>> burden for each minor version release from Flink, which happens roughly
>>> every 4 months.
>>>
>>>
>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary 
>>> wrote:
>>>
>>>> Since you mentioned Hive, I chime in with what we do there. You might
>>>> find it useful:
>>>> - metastore module - only small differences - DynConstructor solves for
>>>> us
>>>> - mr module - some bigger differences, but still manageable for Hive
>>>> 2-3. Need some new classes, but most of the code is reused - extra module
>>>> for Hive 3. For Hive 4 we use a different repo as we moved to the Hive
>>>> codebase.
>>>>
>>>> My thoughts based on the above experience:
>>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly
>>>> have problems with backporting changes between repos and we are slacking
>>>> behind which hurts both projects
>>>> - Hive 2-3 model is working better by f

Re: Help improve Iceberg community meeting experience

2021-10-22 Thread Wing Yew Poon

I have no concerns with Tabular hosting and recording the meetings. I'm in
favor of having the meetings recorded and the recordings available.
- Wing Yew

On Fri, Oct 22, 2021 at 1:59 PM John Zhuge  wrote:

> +1
>
> It will be great to catch up on the meetings missed.
>
> On Fri, Oct 22, 2021 at 12:16 PM Yufei Gu  wrote:
>
>> +1 for recording the meetings. It's especially valuable for the design
>> discussions. I'd suggest adding something like "this meeting will be
>> recorded'' to the event when people send out the invitation.
>>
>> Best,
>>
>> Yufei
>>
>> `This is not a contribution`
>>
>>
>> On Fri, Oct 22, 2021 at 12:05 PM Sam Redai  wrote:
>>
>>> Thanks for raising this Jack! If it's ok with everyone, we can host and
>>> record the google meets via our Tabular account. I'll volunteer to set up
>>> and maintain this as well as uploading the recordings.
>>>
>>> -Sam
>>>
>>> On Fri, Oct 22, 2021 at 11:54 AM Jack Ye  wrote:
>>>
 Hi everyone,

 Recently we have been hosting an increasing number of meetings for
 design discussions and community syncs as more people are getting
 interested in Iceberg and start to contribute exciting features. Right now
 we are relying on individuals to send out meeting invites using free apps,
 but we are restricted by the limitations of those apps.

 I wonder if there is a way for us to leverage a paid remote meeting
 service through Apache foundation or any other organization who is willing
 to sponsor such community meetings. This would allow us to have the
 following benefits:

 1. We are no longer restricted by the 1 hour time limit for Google Meet
 (40 min for Zoom).

 2. We can record the entire meeting and publish it to sites like
 YouTube, so that people who cannot join the meeting can review the entire
 content instead of just meeting note summary.

 This would be very beneficial given the fact that we have pretty big
 communities in US, Europe and Asia time zones, and most meetings can only
 satisfy 2 time zones at best.

 I have asked AWS internally but we can only offer free use of AWS
 Chime, which is not a very popular choice and would probably result in
 fewer people joining the meetings.

 Any thoughts around this area?

 Best,
 Jack Ye

>
> --
> John Zhuge
>

Re: Meeting Minutes from 10/20 Iceberg Sync

2021-10-25 Thread Wing Yew Poon

>
> Adding v3.2 to Spark Build Refactoring
>
>-
>
>Russell and Anton will coordinate on dropping in a Spark 3.2 module
>-
>
>We currently have 3.1 in the `spark3` module. We’ll move that out to
>its own module and mirror what we do with the 3.2 module. (This will enable
>cleaning up some mixed 3.0/3.1 code)
>
> Hi,
I'm sorry I missed the last sync and only have these meeting minutes to go
by.
A Spark 3.2 module has now been added. Is the plan still to add a Spark 3.1
module. Will we have v3.0, v3.1 and v3.2 subdirectories under spark/ ?
I think when we first started discussing the issue for Spark 3 support and
how to organize the code, the proposal was to support two versions?
IMO, for maintainability, we should only support two versions of Spark 3.
However, in this transition period, I can see two approaches:
1. Create a v3.1 subdirectory, remove the reflection workarounds for its
code, add explicit 3.1-specific modules, and build and test against 3.1. We
then have 3 Spark 3 versions. At the next release, deprecate Spark 3.0
support and remove the v3.0 directory and its modules.
2. Support Spark 3.1 and 3.0 from the common 3.0-based code. At the next
release, deprecate Spark 3.0 support, rename v3.0 to v3.1, and update its
code to remove the reflection workarounds.
As I said, I missed the meeting. Perhaps 1 is the plan that was decided?
(If it is, I'm willing to take on the work. I just need to know the plan.)
Thanks,
Wing Yew

Re: Meeting Minutes from 10/20 Iceberg Sync

2021-10-26 Thread Wing Yew Poon

Thanks Sam. Was there also agreement to deprecate Spark 3.0 support and go
with supporting the latest 2 versions of Spark 3?


On Tue, Oct 26, 2021 at 11:36 AM Sam Redai  wrote:

> If I remember correctly, we landed on option 1, creating a v3.1 without
> the extra reflection logic and then just deprecating 3.0 when the time
> comes. If everyone agrees with that I can amend the notes to describe that
> more explicitly.
>
> -Sam
>
> On Mon, Oct 25, 2021 at 11:30 AM Wing Yew Poon 
> wrote:
>
>> Adding v3.2 to Spark Build Refactoring
>>>
>>>-
>>>
>>>Russell and Anton will coordinate on dropping in a Spark 3.2 module
>>>-
>>>
>>>We currently have 3.1 in the `spark3` module. We’ll move that out to
>>>its own module and mirror what we do with the 3.2 module. (This will 
>>> enable
>>>cleaning up some mixed 3.0/3.1 code)
>>>
>>> Hi,
>> I'm sorry I missed the last sync and only have these meeting minutes to
>> go by.
>> A Spark 3.2 module has now been added. Is the plan still to add a Spark
>> 3.1 module. Will we have v3.0, v3.1 and v3.2 subdirectories under spark/ ?
>> I think when we first started discussing the issue for Spark 3 support
>> and how to organize the code, the proposal was to support two versions?
>> IMO, for maintainability, we should only support two versions of Spark 3.
>> However, in this transition period, I can see two approaches:
>> 1. Create a v3.1 subdirectory, remove the reflection workarounds for its
>> code, add explicit 3.1-specific modules, and build and test against 3.1. We
>> then have 3 Spark 3 versions. At the next release, deprecate Spark 3.0
>> support and remove the v3.0 directory and its modules.
>> 2. Support Spark 3.1 and 3.0 from the common 3.0-based code. At the next
>> release, deprecate Spark 3.0 support, rename v3.0 to v3.1, and update its
>> code to remove the reflection workarounds.
>> As I said, I missed the meeting. Perhaps 1 is the plan that was decided?
>> (If it is, I'm willing to take on the work. I just need to know the plan.)
>> Thanks,
>> Wing Yew
>>
>>
>>

Re: Meeting Minutes from 10/20 Iceberg Sync

2021-10-26 Thread Wing Yew Poon

My impression came from when Anton proposed the following in the earlier
thread on Spark version support strategy:

*Option 3 (separate modules for Spark 3.1/3.2)*
> Introduce more modules in the same project.
> Can consume unreleased changes but it will required at least 5 modules to
> support 2.4, 3.1 and 3.2, making the build and testing complicated.


I think the concern was the number of modules that would be needed.
I understand the point about supporting older versions as long as there is
sufficient user base on those versions. Nevertheless having 3 versions of
Spark 3 code to check changes into is a maintenance burden.


On Tue, Oct 26, 2021 at 12:37 PM Ryan Blue  wrote:

> I don't recall there being a consensus to deprecate all but 2 versions of
> Spark. I think the confusion may be because that's what Flink versions are
> supported in that community. For Spark, I think we will need to support
> older versions until most people are able to move off of them, which can
> take a long time. But as versions age, we should definitely try to spend
> less time maintaining them! Hopefully our new structure helps us get to
> that point.
>
> On Tue, Oct 26, 2021 at 12:08 PM Wing Yew Poon 
> wrote:
>
>> Thanks Sam. Was there also agreement to deprecate Spark 3.0 support and
>> go with supporting the latest 2 versions of Spark 3?
>>
>>
>> On Tue, Oct 26, 2021 at 11:36 AM Sam Redai  wrote:
>>
>>> If I remember correctly, we landed on option 1, creating a v3.1 without
>>> the extra reflection logic and then just deprecating 3.0 when the time
>>> comes. If everyone agrees with that I can amend the notes to describe that
>>> more explicitly.
>>>
>>> -Sam
>>>
>>> On Mon, Oct 25, 2021 at 11:30 AM Wing Yew Poon
>>>  wrote:
>>>
>>>> Adding v3.2 to Spark Build Refactoring
>>>>>
>>>>>-
>>>>>
>>>>>Russell and Anton will coordinate on dropping in a Spark 3.2 module
>>>>>-
>>>>>
>>>>>We currently have 3.1 in the `spark3` module. We’ll move that out
>>>>>to its own module and mirror what we do with the 3.2 module. (This will
>>>>>enable cleaning up some mixed 3.0/3.1 code)
>>>>>
>>>>> Hi,
>>>> I'm sorry I missed the last sync and only have these meeting minutes to
>>>> go by.
>>>> A Spark 3.2 module has now been added. Is the plan still to add a Spark
>>>> 3.1 module. Will we have v3.0, v3.1 and v3.2 subdirectories under spark/ ?
>>>> I think when we first started discussing the issue for Spark 3 support
>>>> and how to organize the code, the proposal was to support two versions?
>>>> IMO, for maintainability, we should only support two versions of Spark
>>>> 3. However, in this transition period, I can see two approaches:
>>>> 1. Create a v3.1 subdirectory, remove the reflection workarounds for
>>>> its code, add explicit 3.1-specific modules, and build and test against
>>>> 3.1. We then have 3 Spark 3 versions. At the next release, deprecate Spark
>>>> 3.0 support and remove the v3.0 directory and its modules.
>>>> 2. Support Spark 3.1 and 3.0 from the common 3.0-based code. At the
>>>> next release, deprecate Spark 3.0 support, rename v3.0 to v3.1, and update
>>>> its code to remove the reflection workarounds.
>>>> As I said, I missed the meeting. Perhaps 1 is the plan that was
>>>> decided? (If it is, I'm willing to take on the work. I just need to know
>>>> the plan.)
>>>> Thanks,
>>>> Wing Yew
>>>>
>>>>
>>>>
>
> --
> Ryan Blue
> Tabular
>

Re: Standard practices around PRs against multiple Spark versions

2021-11-03 Thread Wing Yew Poon

I wasn't aware that we were standardizing on such a practice. I don't have
a strong opinion on making changes one Spark version at a time or all at
once. I think committers who do reviews regularly should decide. My only
concern with making changes one version at a time is follow-through on the
part of the contributor. We want to ensure that a change/fix applicable to
multiple versions gets into all of them. Reviewer bandwidth is incurred
either way. (For simple changes/fixes, perhaps all at once does save
reviewer bandwidth, so we may want to be flexible.)

On Wed, Nov 3, 2021 at 5:52 PM Jack Ye  wrote:

> Thanks for bringing this up Kyle!
>
> My personal view is the following:
> 1. For new features, it should be very clear that we always implement them
> against the latest version. At the same time, I suggest we create an issue
> to track backport, so that if anyone is interested in backport he/she can
> work on it separately. We can tag these issues based on title names (e.g.
> "Backport: xxx" as title), and these are also good issues for new
> contributors to work on because there is already reference implementation
> in a newer version.
> 2. For bug fixes, I understand sometimes it's just a one line fix and
> people will try to just fix across versions. My take is that we should try
> to advocate for fixing 1 version and open an issue for other versions
> although it does not really need to be enforced as strictly. Sometimes even
> a 1 line fix has serious implications for different versions and might
> break stuff unintentionally. it's better that people that have production
> dependency on the specific version carefully review and test changes before
> merging.
>
> About enforcement strategy, I suggest we start to create a template for
> Github Issues and PRs
> ,
> where we state the guidelines related to engine versions, as well as
> Iceberg's preferred code style, naming convention, title convention, etc.
> to make new contributors a bit easier to submit changes without too much
> rewrite. Currently I observe that every time there is a new contributor, we
> need to state all the guidelines through PR review, which causes quite a
> lot of time spent on rewriting the code and also reduces the motivation for
> people to continue work on the PR.
>
> Best,
> Jack Ye
>
>
>
>
>
> On Wed, Nov 3, 2021 at 4:13 PM Kyle Bendickson  wrote:
>
>> I submitted a PR to fix a Spark bug today, applying the same changes to
>> all eligible Spark versions.
>>
>> Jack mentioned that he thought the practice going forward was to fix /
>> apply changes on the latest Spark version in one PR, and then open a second
>> PR to backport the fixes (presumably to minimize review overhead).
>>
>> Do we have a standard / preference on that? Jack mentioned he wasn't
>> certain, so I thought I'd ask here.
>>
>> Seems like a good practice but hoping to get some clarification :)
>>
>> --
>> Best,
>> Kyle Bendickson
>> Github: @kbendick
>>
>

Re: Identifying the schema of an Iceberg snapshot

2021-11-08 Thread Wing Yew Poon

I am surprised that schema-id is optional for a v2 snapshot.
I believe that the implementation now already writes a schema-id for both
v1 and v2 snapshots. Of course, snapshots written before schema-id was
added do not have it.
I am working on implementing using the appropriate schema when reading a
snapshot in Spark. It is implemented for Spark 2. It is as you understand
it -- get the schema-id for the snapshot, and look up the schema by
schema-id from the schemas. It will be implemented for Spark 3 too, but
there are some technical complications that need to be resolved first. I
also had a fallback -- if the schema-id is null, then we will look through
the history to find the metadata for the snapshot and read the schema from
there. The fallback was removed from my original PR but will be submitted
as a separate change.
The current behavior (and the behavior in Spark 2 before my change) is to
use the.current schema when reading any snapshot.

On Sun, Nov 7, 2021 at 10:01 PM Vivekanand Vellanki 
wrote:

> Hi,
>
> I am trying to understand how to identify the schema for an Iceberg
> snapshot.
>
> Looking at the spec, I see the following:
> Snapshots
>
> A snapshot consists of the following fields:
> v1v2FieldDescription
> *required* *required* snapshot-id A unique long ID
> *optional* *optional* parent-snapshot-id The snapshot ID of the
> snapshot’s parent. Omitted for any snapshot with no parent
> *required* sequence-number A monotonically increasing long that tracks
> the order of changes to a table
> *required* *required* timestamp-ms A timestamp when the snapshot was
> created, used for garbage collection and table inspection
> *optional* *required* manifest-list The location of a manifest list for
> this snapshot that tracks manifest files with additional meadata
> *optional* manifests A list of manifest file locations. Must be omitted
> if manifest-list is present
> *optional* *required* summary A string map that summarizes the snapshot
> changes, including operation (see below)
> *optional* *optional* schema-id ID of the table’s current schema when the
> snapshot was createdAlso the table metadata portion of the spec says the
> following:
> v1v2FieldDescription
> *optional* *required* schemas A list of schemas, stored as objects with
> schema-id.
> For a v2 Iceberg table, my understanding is that the reader needs to do
> the following to figure out the schema of a snapshot:
>
>- Read the schema-id for the snapshot
>- Use the schemas field from the table metadata and find the schema
>corresponding to the snapshot's schema-id
>
> Since schema-id is optional in V2 for a given snapshot, is this the
> correct approach? How does this work, if the schema-id field is missing?
>
> For a V1 Iceberg table, how do we determine the schema of a particular
> snapshot?
>
> Thanks
> Vivek
>
>

Re: Identifying the schema of an Iceberg snapshot

2021-11-08 Thread Wing Yew Poon

There is logic needed in both core Iceberg (in BaseTableScan and
DataTableScan) and in each engine.


On Mon, Nov 8, 2021 at 9:17 AM Vivekanand Vellanki  wrote:

> I am surprised that the logic of obtaining the schema for a snapshot is
> implemented in Spark2 and Spark3. Shouldn't this be part of Iceberg APIs?
> Basically, the Snapshot object has an API that returns the schema of the
> snapshot.
>
> On Mon, Nov 8, 2021 at 10:24 PM Wing Yew Poon 
> wrote:
>
>> I am surprised that schema-id is optional for a v2 snapshot.
>> I believe that the implementation now already writes a schema-id for both
>> v1 and v2 snapshots. Of course, snapshots written before schema-id was
>> added do not have it.
>> I am working on implementing using the appropriate schema when reading a
>> snapshot in Spark. It is implemented for Spark 2. It is as you understand
>> it -- get the schema-id for the snapshot, and look up the schema by
>> schema-id from the schemas. It will be implemented for Spark 3 too, but
>> there are some technical complications that need to be resolved first. I
>> also had a fallback -- if the schema-id is null, then we will look through
>> the history to find the metadata for the snapshot and read the schema from
>> there. The fallback was removed from my original PR but will be submitted
>> as a separate change.
>> The current behavior (and the behavior in Spark 2 before my change) is to
>> use the.current schema when reading any snapshot.
>>
>>
>>
>>
>> On Sun, Nov 7, 2021 at 10:01 PM Vivekanand Vellanki 
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to understand how to identify the schema for an Iceberg
>>> snapshot.
>>>
>>> Looking at the spec, I see the following:
>>> Snapshots
>>>
>>> A snapshot consists of the following fields:
>>> v1v2FieldDescription
>>> *required* *required* snapshot-id A unique long ID
>>> *optional* *optional* parent-snapshot-id The snapshot ID of the
>>> snapshot’s parent. Omitted for any snapshot with no parent
>>> *required* sequence-number A monotonically increasing long that tracks
>>> the order of changes to a table
>>> *required* *required* timestamp-ms A timestamp when the snapshot was
>>> created, used for garbage collection and table inspection
>>> *optional* *required* manifest-list The location of a manifest list for
>>> this snapshot that tracks manifest files with additional meadata
>>> *optional* manifests A list of manifest file locations. Must be omitted
>>> if manifest-list is present
>>> *optional* *required* summary A string map that summarizes the snapshot
>>> changes, including operation (see below)
>>> *optional* *optional* schema-id ID of the table’s current schema when
>>> the snapshot was createdAlso the table metadata portion of the spec
>>> says the following:
>>> v1v2FieldDescription
>>> *optional* *required* schemas A list of schemas, stored as objects with
>>> schema-id.
>>> For a v2 Iceberg table, my understanding is that the reader needs to do
>>> the following to figure out the schema of a snapshot:
>>>
>>>- Read the schema-id for the snapshot
>>>- Use the schemas field from the table metadata and find the schema
>>>corresponding to the snapshot's schema-id
>>>
>>> Since schema-id is optional in V2 for a given snapshot, is this the
>>> correct approach? How does this work, if the schema-id field is missing?
>>>
>>> For a V1 Iceberg table, how do we determine the schema of a particular
>>> snapshot?
>>>
>>> Thanks
>>> Vivek
>>>
>>>

Re: Identifying the schema of an Iceberg snapshot

2021-11-08 Thread Wing Yew Poon

The fallback logic I mentioned will be in core Iceberg.


On Mon, Nov 8, 2021 at 9:35 AM Wing Yew Poon  wrote:

> There is logic needed in both core Iceberg (in BaseTableScan and
> DataTableScan) and in each engine.
>
>
> On Mon, Nov 8, 2021 at 9:17 AM Vivekanand Vellanki 
> wrote:
>
>> I am surprised that the logic of obtaining the schema for a snapshot is
>> implemented in Spark2 and Spark3. Shouldn't this be part of Iceberg APIs?
>> Basically, the Snapshot object has an API that returns the schema of the
>> snapshot.
>>
>> On Mon, Nov 8, 2021 at 10:24 PM Wing Yew Poon 
>> wrote:
>>
>>> I am surprised that schema-id is optional for a v2 snapshot.
>>> I believe that the implementation now already writes a schema-id for
>>> both v1 and v2 snapshots. Of course, snapshots written before schema-id was
>>> added do not have it.
>>> I am working on implementing using the appropriate schema when reading a
>>> snapshot in Spark. It is implemented for Spark 2. It is as you understand
>>> it -- get the schema-id for the snapshot, and look up the schema by
>>> schema-id from the schemas. It will be implemented for Spark 3 too, but
>>> there are some technical complications that need to be resolved first. I
>>> also had a fallback -- if the schema-id is null, then we will look through
>>> the history to find the metadata for the snapshot and read the schema from
>>> there. The fallback was removed from my original PR but will be submitted
>>> as a separate change.
>>> The current behavior (and the behavior in Spark 2 before my change) is
>>> to use the.current schema when reading any snapshot.
>>>
>>>
>>>
>>>
>>> On Sun, Nov 7, 2021 at 10:01 PM Vivekanand Vellanki 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to understand how to identify the schema for an Iceberg
>>>> snapshot.
>>>>
>>>> Looking at the spec, I see the following:
>>>> Snapshots
>>>>
>>>> A snapshot consists of the following fields:
>>>> v1v2FieldDescription
>>>> *required* *required* snapshot-id A unique long ID
>>>> *optional* *optional* parent-snapshot-id The snapshot ID of the
>>>> snapshot’s parent. Omitted for any snapshot with no parent
>>>> *required* sequence-number A monotonically increasing long that tracks
>>>> the order of changes to a table
>>>> *required* *required* timestamp-ms A timestamp when the snapshot was
>>>> created, used for garbage collection and table inspection
>>>> *optional* *required* manifest-list The location of a manifest list
>>>> for this snapshot that tracks manifest files with additional meadata
>>>> *optional* manifests A list of manifest file locations. Must be
>>>> omitted if manifest-list is present
>>>> *optional* *required* summary A string map that summarizes the
>>>> snapshot changes, including operation (see below)
>>>> *optional* *optional* schema-id ID of the table’s current schema when
>>>> the snapshot was createdAlso the table metadata portion of the spec
>>>> says the following:
>>>> v1v2FieldDescription
>>>> *optional* *required* schemas A list of schemas, stored as objects
>>>> with schema-id.
>>>> For a v2 Iceberg table, my understanding is that the reader needs to do
>>>> the following to figure out the schema of a snapshot:
>>>>
>>>>- Read the schema-id for the snapshot
>>>>- Use the schemas field from the table metadata and find the schema
>>>>corresponding to the snapshot's schema-id
>>>>
>>>> Since schema-id is optional in V2 for a given snapshot, is this the
>>>> correct approach? How does this work, if the schema-id field is missing?
>>>>
>>>> For a V1 Iceberg table, how do we determine the schema of a particular
>>>> snapshot?
>>>>
>>>> Thanks
>>>> Vivek
>>>>
>>>>

publish snapshot to maven workflow

2021-11-08 Thread Wing Yew Poon

Hi,
I know that there is a github workflow to publish snapshot to maven. This
workflow fails in my fork of the Iceberg repo (I imagine because I don't
have permissions). How are folks dealing with this? I just don't need to
receive daily emails that the workflow failed.
Thanks,
Wing Yew

Re: Welcome new PMC members!

2021-11-17 Thread Wing Yew Poon

Congratulations Jack and Russell! Well done, and well deserved.
- Wing Yew


On Wed, Nov 17, 2021 at 4:13 PM Kyle Bendickson  wrote:

> Congratulations to both Jack and Russell!
>
> Very we deserved indeed :)
>
> On Wed, Nov 17, 2021 at 4:12 PM Ryan Blue  wrote:
>
>> Hi everyone, I want to welcome Jack Ye and Russell Spitzer to the Iceberg
>> PMC. They've both been amazing at reviewing and helping people in the
>> community and the PMC has decided to invite them to join. Congratulations,
>> Jack and Russell! Thank you for all your hard work and support for the
>> project.
>>
>> Ryan
>>
>> --
>> Ryan Blue
>>
>

Re: Welcome Szehon Ho as a committer!

2022-03-11 Thread Wing Yew Poon

Congratulations Szehon!


On Fri, Mar 11, 2022 at 3:42 PM Sam Redai  wrote:

> Congrats Szehon!
>
> On Fri, Mar 11, 2022 at 6:41 PM Yufei Gu  wrote:
>
>> Congratulations Szehon!
>> Best,
>>
>> Yufei
>>
>> `This is not a contribution`
>>
>>
>> On Fri, Mar 11, 2022 at 3:36 PM Ryan Blue  wrote:
>>
>>> Congratulations Szehon!
>>>
>>> Sorry I accidentally preempted this announcement with the board report!
>>>
>>> On Fri, Mar 11, 2022 at 3:32 PM Anton Okolnychyi
>>>  wrote:
>>>
 Hey everyone,

 I would like to welcome Szehon Ho as a new committer to the project!

 Thanks for all your work, Szehon!

 - Anton

>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Re: Hive 4.0.0-alpha-1 release is available with Iceberg integration

2022-04-07 Thread Wing Yew Poon

Congratulations on the release, making available this functionality for
Hive users!


On Thu, Apr 7, 2022 at 9:11 AM Peter Vary 
wrote:

> Hi Team,
>
> I would like to let you know that the Hive team released Hive
> 4.0.0-alpha-1.
>
> Using this release it is possible to create, read, write Iceberg V1 tables
> with Hive. There are some rough edges there but most of the queries,
> functions should be working.
>
> Just some examples:
>
>
>
>
>
>
>
>
>
> *CREATE EXTERNAL TABLE ice_t (s STRING, i INT, j INT) PARTITIONED BY SPEC
> (TRUNCATE(1, s)) STORED BY ICEBERG STORED AS ORC;INSERT INTO ice_t VALUES
> ('hive', 4, 4);ALTER TABLE ice_t SET PARTITION SPEC (s);INSERT INTO ice_t
> VALUES ('impala', 5, 5);SELECT * FROM
> default.ice_t.entries;SELECT * FROM default.ice_t.files;SELECT * FROM 
> default.ice_t.history;SELECT
> * FROM ice_t FOR SYSTEM_TIME AS OF '2022-02-14 12:41:50';SELECT * FROM
> ice_t FOR SYSTEM_VERSION AS OF 5074093329028989995;*
>
>
> If you have time feel free to check this out, and any feedback is welcome.
>
> Thanks,
> Peter
>
>

Re: Problem with partitioned table creation in scala

2022-05-27 Thread Wing Yew Poon

That is a typo in the sample code. The doc itself (
https://iceberg.apache.org/docs/latest/spark-writes/#creating-tables) says:
"Create and replace operations support table configuration methods, like
partitionedBy and tableProperty"
You could also have looked up the API in Spark documentation:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriterV2.html
There you would have found that the method is partitionedBy, not
partitionBy.

- Wing Yew

On Fri, May 27, 2022 at 4:32 AM Saulius Pakalka
 wrote:

> Hi,
>
> I am trying to create partitioned iceberg table using scala code below
> based on example in docs.
>
> df_c.writeTo(output_table)
>   .partitionBy(days(col("last_updated")))
>   .createOrReplace()
>
> However, this code does not compile and throws two errors:
>
> value partitionBy is not a member of
> org.apache.spark.sql.DataFrameWriterV2[org.apache.spark.sql.Row]
> [error] possible cause: maybe a semicolon is missing before `value
> partitionBy'?
> [error]   .partitionBy(days(col("last_updated")))
> [error]^
> [error]  not found: value days
> [error]   .partitionBy(days(col("last_updated")))
> [error]^
> [error] two errors found
>
> Not sure where to look for a problem. Any advice appreciated.
>
> Best regards,
>
> Saulius Pakalka
>
>

Re: Problem with partitioned table creation in scala

2022-05-27 Thread Wing Yew Poon

One other note:
When creating the table, you need `using("iceberg")`. The example should
read

data.writeTo("prod.db.table")
.using("iceberg")
.tableProperty("write.format.default", "orc")
.partitionedBy($"level", days($"ts"))
.createOrReplace()

- Wing Yew


On Fri, May 27, 2022 at 11:29 AM Wing Yew Poon  wrote:

> That is a typo in the sample code. The doc itself (
> https://iceberg.apache.org/docs/latest/spark-writes/#creating-tables)
> says:
> "Create and replace operations support table configuration methods, like
> partitionedBy and tableProperty"
> You could also have looked up the API in Spark documentation:
>
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriterV2.html
> There you would have found that the method is partitionedBy, not
> partitionBy.
>
> - Wing Yew
>
>
> On Fri, May 27, 2022 at 4:32 AM Saulius Pakalka
>  wrote:
>
>> Hi,
>>
>> I am trying to create partitioned iceberg table using scala code below
>> based on example in docs.
>>
>> df_c.writeTo(output_table)
>>   .partitionBy(days(col("last_updated")))
>>   .createOrReplace()
>>
>> However, this code does not compile and throws two errors:
>>
>> value partitionBy is not a member of
>> org.apache.spark.sql.DataFrameWriterV2[org.apache.spark.sql.Row]
>> [error] possible cause: maybe a semicolon is missing before `value
>> partitionBy'?
>> [error]   .partitionBy(days(col("last_updated")))
>> [error]^
>> [error]  not found: value days
>> [error]   .partitionBy(days(col("last_updated")))
>> [error]^
>> [error] two errors found
>>
>> Not sure where to look for a problem. Any advice appreciated.
>>
>> Best regards,
>>
>> Saulius Pakalka
>>
>>

Re: Problem with partitioned table creation in scala

2022-05-27 Thread Wing Yew Poon

The partitionedBy typo in the doc is already fixed in the master branch of
the Iceberg repo.
I filed a PR to add `using("iceberg")` to the `writeTo` examples for
creating a table (if you want to create an *Iceberg* table).

On Fri, May 27, 2022 at 12:58 PM Wing Yew Poon  wrote:

> One other note:
> When creating the table, you need `using("iceberg")`. The example should
> read
>
> data.writeTo("prod.db.table")
> .using("iceberg")
> .tableProperty("write.format.default", "orc")
> .partitionedBy($"level", days($"ts"))
> .createOrReplace()
>
> - Wing Yew
>
>
> On Fri, May 27, 2022 at 11:29 AM Wing Yew Poon 
> wrote:
>
>> That is a typo in the sample code. The doc itself (
>> https://iceberg.apache.org/docs/latest/spark-writes/#creating-tables)
>> says:
>> "Create and replace operations support table configuration methods, like
>> partitionedBy and tableProperty"
>> You could also have looked up the API in Spark documentation:
>>
>> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriterV2.html
>> There you would have found that the method is partitionedBy, not
>> partitionBy.
>>
>> - Wing Yew
>>
>>
>> On Fri, May 27, 2022 at 4:32 AM Saulius Pakalka
>>  wrote:
>>
>>> Hi,
>>>
>>> I am trying to create partitioned iceberg table using scala code below
>>> based on example in docs.
>>>
>>> df_c.writeTo(output_table)
>>>   .partitionBy(days(col("last_updated")))
>>>   .createOrReplace()
>>>
>>> However, this code does not compile and throws two errors:
>>>
>>> value partitionBy is not a member of
>>> org.apache.spark.sql.DataFrameWriterV2[org.apache.spark.sql.Row]
>>> [error] possible cause: maybe a semicolon is missing before `value
>>> partitionBy'?
>>> [error]   .partitionBy(days(col("last_updated")))
>>> [error]^
>>> [error]  not found: value days
>>> [error]   .partitionBy(days(col("last_updated")))
>>> [error]^
>>> [error] two errors found
>>>
>>> Not sure where to look for a problem. Any advice appreciated.
>>>
>>> Best regards,
>>>
>>> Saulius Pakalka
>>>
>>>

Re: Welcome Yufei Gu as a committer

2022-08-25 Thread Wing Yew Poon

Congratulations, Yufei!


On Thu, Aug 25, 2022 at 4:23 PM Sam Redai  wrote:

> Congrats Yufei! 🎉
>
> On Thu, Aug 25, 2022 at 7:20 PM Anton Okolnychyi
>  wrote:
>
>> I’d like to welcome Yufei Gu as a committer to the project.
>>
>> Thanks for all your hard work, Yufei!
>>
>> - Anton
>
> --
>
> Sam Redai 
>
> Developer Advocate  |  Tabular 
>

Re: Proposal - Priority based commit ordering on partitions

2022-10-03 Thread Wing Yew Poon

Hi Prashant,
I am very interested in this proposal and would like to attend this
meeting. Friday October 7 is fine with me; I can do 9 pm Pacific Time if
that is what works for you (I don't know what time zone you're in),
although any time between 2 and 6 pm would be more convenient.
Thanks,
Wing Yew

On Mon, Oct 3, 2022 at 11:58 AM Prashant Singh 
wrote:

> Thanks Ryan,
>
> Should I go ahead and schedule this somewhere around 10/7 9:00 PM PST,
> will it work ?
>
> Regards,
> Prashant Singh
>
> On Fri, Sep 30, 2022 at 9:21 PM Ryan Blue  wrote:
>
>> Prashant, great to see the PR for rollback on conflict! I'll take a look
>> at that one. Friday 10/7 after 1:30 PM works for me. Looking forward to the
>> discussion!
>>
>> On Fri, Sep 30, 2022 at 6:38 AM Prashant Singh 
>> wrote:
>>
>>> Hello folks,
>>>
>>> I was planning to host a discussion on this proposal
>>> 
>>> somewhere around late next week.
>>>
>>> Please let me know your availability if you are interested in attending
>>> the same, will schedule the meeting (online) accordingly.
>>>
>>> Meanwhile I have a PR  out
>>> as well, to rollback compaction on conflict detection (an approach that
>>> came up as an alternative to the proposal in sync). Appreciate your
>>> feedback here as well.
>>>
>>> Regards,
>>> Prashant Singh
>>>
>>> On Wed, Aug 17, 2022 at 6:25 PM Prashant Singh 
>>> wrote:
>>>
 Hello all,

 We have been working on a proposal [link
 ]
 to determine the precedence between two or more concurrently running jobs,
 in case of conflicts.

 Please take some time to review the proposal.

 We would appreciate any feedback on this from the community!

 Thanks,
 Prashant Singh

>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Proposal - Priority based commit ordering on partitions

2022-10-03 Thread Wing Yew Poon

Prashant, just saw Jack's post mentioning that you're in India Time.
Obviously day time Pacific is not convenient for you. I'm fine with 9 pm
Pacific.


On Mon, Oct 3, 2022 at 12:09 PM Wing Yew Poon  wrote:

> Hi Prashant,
> I am very interested in this proposal and would like to attend this
> meeting. Friday October 7 is fine with me; I can do 9 pm Pacific Time if
> that is what works for you (I don't know what time zone you're in),
> although any time between 2 and 6 pm would be more convenient.
> Thanks,
> Wing Yew
>
>
> On Mon, Oct 3, 2022 at 11:58 AM Prashant Singh 
> wrote:
>
>> Thanks Ryan,
>>
>> Should I go ahead and schedule this somewhere around 10/7 9:00 PM PST,
>> will it work ?
>>
>> Regards,
>> Prashant Singh
>>
>> On Fri, Sep 30, 2022 at 9:21 PM Ryan Blue  wrote:
>>
>>> Prashant, great to see the PR for rollback on conflict! I'll take a look
>>> at that one. Friday 10/7 after 1:30 PM works for me. Looking forward to the
>>> discussion!
>>>
>>> On Fri, Sep 30, 2022 at 6:38 AM Prashant Singh 
>>> wrote:
>>>
>>>> Hello folks,
>>>>
>>>> I was planning to host a discussion on this proposal
>>>> <https://docs.google.com/document/d/1pSqxf5A59J062j9VFF5rcCpbW9vdTbBKTmjps80D-B0/edit>
>>>> somewhere around late next week.
>>>>
>>>> Please let me know your availability if you are interested in attending
>>>> the same, will schedule the meeting (online) accordingly.
>>>>
>>>> Meanwhile I have a PR <https://github.com/apache/iceberg/pull/5888>
>>>> out as well, to rollback compaction on conflict detection (an approach that
>>>> came up as an alternative to the proposal in sync). Appreciate your
>>>> feedback here as well.
>>>>
>>>> Regards,
>>>> Prashant Singh
>>>>
>>>> On Wed, Aug 17, 2022 at 6:25 PM Prashant Singh <
>>>> prashant010...@gmail.com> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> We have been working on a proposal [link
>>>>> <https://docs.google.com/document/d/1pSqxf5A59J062j9VFF5rcCpbW9vdTbBKTmjps80D-B0/edit#>]
>>>>> to determine the precedence between two or more concurrently running jobs,
>>>>> in case of conflicts.
>>>>>
>>>>> Please take some time to review the proposal.
>>>>>
>>>>> We would appreciate any feedback on this from the community!
>>>>>
>>>>> Thanks,
>>>>> Prashant Singh
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Wing Yew Poon

First, thank you all for your responses to my question.

For Peter's question, I believe that (b) is the correct behavior. It is
also the current behavior when using copy-on-write (deletes and updates are
still supported but not using delete files). A changelog scan is an
incremental scan over multiple snapshots. It should emit changes for each
snapshot in the requested range. Spark provides additional functionality on
top of the changelog scan, to produce net changes for the requested range.
See
https://iceberg.apache.org/docs/latest/spark-procedures/#create_changelog_view.
Basically the create_changelog_view procedure uses a changelog scan (read
the changelog table, i.e., .changes) to get a DataFrame which is
saved to a temporary Spark view which can then be queried; if net_changes
is true, only the net changes are produced for this temporary view. This
functionality uses ChangelogIterator.removeNetCarryovers (which is in
Spark).


On Thu, Aug 22, 2024 at 7:51 AM Steven Wu  wrote:

> Peter, good question. In this case, (b) is the complete change history.
> (a) is the squashed version.
>
> I would probably check how other changelog systems deal with this scenario.
>
> On Thu, Aug 22, 2024 at 3:49 AM Péter Váry 
> wrote:
>
>> Technically different, but somewhat similar question:
>>
>> What is the expected behaviour when the `IncrementalScan` is created for
>> not a single snapshot, but for multiple snapshots?
>> S1 added PK1-V1
>> S2 updated PK1-V1 to PK1-V1b (removed PK1-V1 and added PK1-V1b)
>> S3 updated PK1-V1b to PK1-V1c (removed PK1-V1b and added PK1-V1c)
>>
>> Let's say we have
>> *IncrementalScan.fromSnapshotInclusive(S1).toSnapshot(S3)*.
>> Or we need to return:
>> (a)
>> - PK1,V1c,INSERTED
>>
>> Or is it ok, to return:
>> (b)
>> - PK1,V1,INSERTED
>> - PK1,V1,DELETED
>> - PK1,V1b,INSERTED
>> - PK1,V1b,DELETED
>> - PK1,V1c,INSERTED
>>
>> I think the (a) is the correct behaviour.
>>
>> Thanks,
>> Peter
>>
>> Steven Wu  ezt írta (időpont: 2024. aug. 21., Sze,
>> 22:27):
>>
>>> Agree with everyone that option (a) is the correct behavior.
>>>
>>> On Wed, Aug 21, 2024 at 11:57 AM Steve Zhang
>>>  wrote:
>>>
>>>> I agree that option (a) is what user expects for row level changes.
>>>>
>>>> I feel the added deletes in given snapshots provides a PK of DELETED
>>>> entry, existing deletes are used to read together with data files to find
>>>> DELETED value (V1b) and result of columns.
>>>>
>>>> Thanks,
>>>> Steve Zhang
>>>>
>>>>
>>>>
>>>> On Aug 20, 2024, at 6:06 PM, Wing Yew Poon 
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have a PR open to add changelog support for the case where delete
>>>> files are present (https://github.com/apache/iceberg/pull/10935). I
>>>> have a question about what the changelog should emit in the following
>>>> scenario:
>>>>
>>>> The table has a schema with a primary key/identifier column PK and
>>>> additional column V.
>>>> In snapshot 1, we write a data file DF1 with rows
>>>> PK1, V1
>>>> PK2, V2
>>>> etc.
>>>> In snapshot 2, we write an equality delete file ED1 with PK=PK1, and
>>>> new data file DF2 with rows
>>>> PK1, V1b
>>>> (possibly other rows)
>>>> In snapshot 3, we write an equality delete file ED2 with PK=PK1, and
>>>> new data file DF3 with rows
>>>> PK1, V1c
>>>> (possibly other rows)
>>>>
>>>> Thus, in snapshot 2 and snapshot 3, we update the row identified by PK1
>>>> with new values by using an equality delete and writing new data for the
>>>> row.
>>>> These are the files present in snapshot 3:
>>>> DF1 (sequence number 1)
>>>> DF2 (sequence number 2)
>>>> DF3 (sequence number 3)
>>>> ED1 (sequence number 2)
>>>> ED2 (sequence number 3)
>>>>
>>>> The question I have is what should the changelog emit for snapshot 3?
>>>> For snapshot 1, the changelog should emit a row for each row in DF1 as
>>>> INSERTED.
>>>> For snapshot 2, it should emit a row for PK1, V1 as DELETED; and a row
>>>> for PK1, V1b as INSERTED.
>>>> For snapshot 3, I see two possibilities:
>>>> (a)
>>>> PK1,V1b,DELETED
>>>> PK1,V1c,INSERTED
>>>>
>>>> (b)
>>>> PK1,V1,DELETED
>>>> PK1,V1b,DELETED
>>>> PK1,V1c,INSERTED
>>>>
>>>> The interpretation for (b) is that both ED1 and ED2 apply to DF1, with
>>>> ED1 being an existing delete file and ED2 being an added delete file for
>>>> it. We discount ED1 and apply ED2 and get a DELETED row for PK1,V1.
>>>> ED2 also applies to DF2, from which we get a DELETED row for PK1, V1b.
>>>>
>>>> The interpretation for (a) is that ED1 is an existing delete file for
>>>> DF1 and in snapshot 3, the row PK1,V1 already does not exist before the
>>>> snapshot. Thus we do emit a row for it. (We can think of it as ED1 is
>>>> already applied to DF1, and we only consider any additional rows that get
>>>> deleted when ED2 is applied.)
>>>>
>>>> I lean towards (a), as I think it is more reflective of net changes.
>>>> I am interested to hear what folks think.
>>>>
>>>> Thank you,
>>>> Wing Yew
>>>>
>>>>
>>>>
>>>>

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Wing Yew Poon

Peter,

The Spark procedure is implemented by CreateChangelogViewProcedure.java
<https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java>.
This was already added by Yufei in Iceberg 1.2.0.
ChangelogIterator
<https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java>
is
a base class that contains static methods such as the removeNetCarryovers I
mentioned; RemoveNetCarryoverIterator
<https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/RemoveNetCarryoverIterator.java>
is
a subclass that computes the net changes.
These are Spark specific as they work with iterators of
org.apache.spark.sql.Row.

BaseIncrementalChangelogScan
<https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java>
is
a common building block that can be used by other engines than Spark; it
powers the Spark ChangelogRowReader. In the engines, whether Spark or Flink
or some other, choices can be made or made available for what records to
show. However, as a building block, I think we need the generation of all
the changes for each snapshot in the requested range. If you have ideas for
expanding the API of BaseIncrementalChangelogScan so that refinements of
what records to emit can be pushed down to it, I'd be interested in hearing
them. (They will be beyond the scope of my current PR, I think.)

- Wing Yew

On Thu, Aug 22, 2024 at 11:51 AM Péter Váry 
wrote:

> That's good info. I didn't know that we already have the Spark procedure
> at hand.
> How does Spark calculate the `changelog_view`? Do we already have an
> implementation at hand somewhere? Could it be reused?
>
> Anyways, if we want to reuse the new changelogscan for the changelog_view
> as well, then I agree that we need to provide a solution for (b). I think
> that (a)/net_changes is also important as streaming readers for the table
> are often not interested in the intermediate states, just in the final
> changes. And (a) could result in far fewer records which means better
> performance, lower resource usage.
>
> Steve Zhang  ezt írta (időpont: 2024.
> aug. 22., Cs, 19:47):
>
>> Yeah agree on this, I think for changelogscan to convert per snapshot
>> scan to tasks the option b with complete history is the right way. While
>> there shall be an option to configure if net/squashed changes are desired.
>>
>> Also, In spark create_catalog_view, the net changes and compute update
>> cannot be used together.
>>
>> Thanks,
>> Steve Zhang
>>
>>
>>
>> On Aug 22, 2024, at 8:50 AM, Steven Wu  wrote:
>>
>> >  It should emit changes for each snapshot in the requested range.
>>
>> Wing Yew has a good point here. +1
>>
>> On Thu, Aug 22, 2024 at 8:46 AM Wing Yew Poon 
>> wrote:
>>
>>> First, thank you all for your responses to my question.
>>>
>>> For Peter's question, I believe that (b) is the correct behavior. It is
>>> also the current behavior when using copy-on-write (deletes and updates are
>>> still supported but not using delete files). A changelog scan is an
>>> incremental scan over multiple snapshots. It should emit changes for each
>>> snapshot in the requested range. Spark provides additional functionality on
>>> top of the changelog scan, to produce net changes for the requested range.
>>> See
>>> https://iceberg.apache.org/docs/latest/spark-procedures/#create_changelog_view.
>>> Basically the create_changelog_view procedure uses a changelog scan (read
>>> the changelog table, i.e., .changes) to get a DataFrame which is
>>> saved to a temporary Spark view which can then be queried; if net_changes
>>> is true, only the net changes are produced for this temporary view. This
>>> functionality uses ChangelogIterator.removeNetCarryovers (which is in
>>> Spark).
>>>
>>>
>>> On Thu, Aug 22, 2024 at 7:51 AM Steven Wu  wrote:
>>>
>>>> Peter, good question. In this case, (b) is the complete change history.
>>>> (a) is the squashed version.
>>>>
>>>> I would probably check how other changelog systems deal with this
>>>> scenario.
>>>>
>>>> On Thu, Aug 22, 2024 at 3:49 AM Péter Váry 
>>>> wrote:
>>>>
>>>>> Technically different, but somewhat similar question:
>>>>>
>>>>> What is the expected behaviour when the `IncrementalScan` is created
>>>>> for not a single snapshot, but

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Wing Yew Poon

Just a note that the functionality to compute net changes was added by
Yufei only in Iceberg 1.4.0, in #7326
<https://github.com/apache/iceberg/pull/7326>.

On Thu, Aug 22, 2024 at 12:48 PM Wing Yew Poon  wrote:

> Peter,
>
> The Spark procedure is implemented by CreateChangelogViewProcedure.java
> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java>.
> This was already added by Yufei in Iceberg 1.2.0.
> ChangelogIterator
> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java>
>  is
> a base class that contains static methods such as the removeNetCarryovers I
> mentioned; RemoveNetCarryoverIterator
> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/RemoveNetCarryoverIterator.java>
>  is
> a subclass that computes the net changes.
> These are Spark specific as they work with iterators of
> org.apache.spark.sql.Row.
>
> BaseIncrementalChangelogScan
> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java>
>  is
> a common building block that can be used by other engines than Spark; it
> powers the Spark ChangelogRowReader. In the engines, whether Spark or Flink
> or some other, choices can be made or made available for what records to
> show. However, as a building block, I think we need the generation of all
> the changes for each snapshot in the requested range. If you have ideas for
> expanding the API of BaseIncrementalChangelogScan so that refinements of
> what records to emit can be pushed down to it, I'd be interested in hearing
> them. (They will be beyond the scope of my current PR, I think.)
>
> - Wing Yew
>
>
> On Thu, Aug 22, 2024 at 11:51 AM Péter Váry 
> wrote:
>
>> That's good info. I didn't know that we already have the Spark procedure
>> at hand.
>> How does Spark calculate the `changelog_view`? Do we already have an
>> implementation at hand somewhere? Could it be reused?
>>
>> Anyways, if we want to reuse the new changelogscan for the changelog_view
>> as well, then I agree that we need to provide a solution for (b). I think
>> that (a)/net_changes is also important as streaming readers for the table
>> are often not interested in the intermediate states, just in the final
>> changes. And (a) could result in far fewer records which means better
>> performance, lower resource usage.
>>
>> Steve Zhang  ezt írta (időpont: 2024.
>> aug. 22., Cs, 19:47):
>>
>>> Yeah agree on this, I think for changelogscan to convert per snapshot
>>> scan to tasks the option b with complete history is the right way. While
>>> there shall be an option to configure if net/squashed changes are desired.
>>>
>>> Also, In spark create_catalog_view, the net changes and compute update
>>> cannot be used together.
>>>
>>> Thanks,
>>> Steve Zhang
>>>
>>>
>>>
>>> On Aug 22, 2024, at 8:50 AM, Steven Wu  wrote:
>>>
>>> >  It should emit changes for each snapshot in the requested range.
>>>
>>> Wing Yew has a good point here. +1
>>>
>>> On Thu, Aug 22, 2024 at 8:46 AM Wing Yew Poon
>>>  wrote:
>>>
>>>> First, thank you all for your responses to my question.
>>>>
>>>> For Peter's question, I believe that (b) is the correct behavior. It is
>>>> also the current behavior when using copy-on-write (deletes and updates are
>>>> still supported but not using delete files). A changelog scan is an
>>>> incremental scan over multiple snapshots. It should emit changes for each
>>>> snapshot in the requested range. Spark provides additional functionality on
>>>> top of the changelog scan, to produce net changes for the requested range.
>>>> See
>>>> https://iceberg.apache.org/docs/latest/spark-procedures/#create_changelog_view.
>>>> Basically the create_changelog_view procedure uses a changelog scan (read
>>>> the changelog table, i.e., .changes) to get a DataFrame which is
>>>> saved to a temporary Spark view which can then be queried; if net_changes
>>>> is true, only the net changes are produced for this temporary view. This
>>>> functionality uses ChangelogIterator.removeNetCarryovers (which is in
>>>> Spark).
>>>>
>>>>
>>>> On Thu, Aug 22, 2024 at 7:51 AM Steven Wu  wrote:
>>>>
>>&

Re: [DISCUSS] Apache Iceberg 1.7.0 Release Cutoff

2024-10-21 Thread Wing Yew Poon

Hi Russell,
There is a data correctness issue (
https://github.com/apache/iceberg/issues/11221) that I have a fix for (
https://github.com/apache/iceberg/pull/11247). This is a serious issue, and
I'd like to see the fix go into 1.7.0.
Eduard has already approved the PR, but he asked if you or Amogh would take
a look as well.
Thanks,
Wing Yew


On Mon, Oct 21, 2024 at 8:56 AM Russell Spitzer 
wrote:

> That's still my current plan
>
> On Mon, Oct 21, 2024 at 10:52 AM Rodrigo Meneses 
> wrote:
>
>> Hi, team. Are we still targeting to cut off on October 25th and release
>> by Oct 31the, for the 1.7.0 release?
>> Thanks
>> -Rodrigo
>>
>>
>> On Thu, Oct 3, 2024 at 9:03 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi Russ
>>>
>>> As discussed during the community sync, I agree with this plan.
>>>
>>> In the meantime (as you saw on row lineage doc), I'm working on V3
>>> spec proposals (reviews, PRs, ...).
>>>
>>> If needed, I can be volunteer as release manager for the 1.7.0 release.
>>>
>>> Regards
>>> JB
>>>
>>> On Fri, Oct 4, 2024 at 5:55 AM Russell Spitzer
>>>  wrote:
>>> >
>>> > Hi y'all!
>>> >
>>> > As discussed at the community sync on Wednesday, October has begun and
>>> we are beginning to flesh out the 1.7.0 release as well as the V3 Table
>>> Spec. Since we are a little worried that we won't have all of the Spec
>>> items we want by the end of October,  we discussed that we may want to just
>>> do a release with what we have at the end of the month.
>>> >
>>> > It was noted that we have a lot of exciting things already in and we
>>> may be able to get just the beginning of V3 support as well.
>>> >
>>> > To that end it was proposed that we do the next Iceberg release at the
>>> end of the month (Oct 31st) , and have the cutoff a week before (Oct 25th).
>>> Does anyone have objections or statements of support for this plan?
>>> >
>>> > With this in mind please also start marking any remaining PR's or
>>> Issues that you have with 1.7.0 so we can prioritize them for the cutoff
>>> date.
>>> >
>>> >
>>> > Thanks everyone for your time,
>>> > Russ
>>>
>>

Re: [ANNOUNCE] Apache Iceberg release 1.6.1

2024-09-25 Thread Wing Yew Poon

I do not see release notes for 1.6.1.
Shouldn't https://iceberg.apache.org/releases/ have a section for 1.6.1 and
highlights of the changes? (And for 1.6.1 to show up in the Table of
contents on the right?)

On Wed, Aug 28, 2024 at 8:34 AM Carl Steinbach  wrote:

> I'm pleased to announce the release of Apache Iceberg 1.6.1!
>
> Apache Iceberg is an open table format for huge analytic datasets. Iceberg
> delivers high query performance for tables with tens of petabytes of data,
> along with atomic commits, concurrent writes, and SQL-compatible table
> evolution.
>
> This release can be downloaded from:
> https://dlcdn.apache.org/iceberg/apache-iceberg-1.6.1/apache-iceberg-1.6.1.tar.gz
>
> Release notes: https://iceberg.apache.org/releases/#1.6.1-release
>
> Java artifacts are available from Maven Central.
>
> Thanks to everyone for contributing!
>
>

Re: [DISCUSS] Hive Support

2024-11-25 Thread Wing Yew Poon

For the Hive runtime, would it be feasible for the Hive community to
contribute a suite of tests to the Iceberg repo that can be run with
dependencies from the latest Hive release (Hive 4.x), and then update them
from time to time as appropriate? The purpose of this suite would be to
test integration of Iceberg core with the Hive runtime. Perhaps the
existing tests in the mr and hive3 modules could be a starting point, or
you might decide on different tests altogether.
The development of the Hive runtime would then continue as now in the Hive
repo, but you gain better assurance of compatibility with ongoing Iceberg
development, with a relatively small maintenance burden in Iceberg.



On Mon, Nov 25, 2024 at 11:56 AM Ayush Saxena  wrote:

> Hi Peter,
>
> Thanks for bringing this to our attention.
>
> From my side, I have a say only on the code that resides in the Hive
> repository. I am okay with the first approach, as we are already
> following it for the most part. Whether Iceberg keeps or drops the
> code shouldn’t have much impact on us. (I don't think I have a say on
> that either) That said, it would be helpful if they continue running
> tests against the latest stable Hive releases to ensure that any
> changes don’t unintentionally break something for Hive, which would be
> beyond our control.
>
> Regarding having a separate code repository for the connectors, I
> believe the challenges would outweigh the benefits. As mentioned, the
> initial workload would be significant, but more importantly,
> maintaining a regular cadence of releases would be even more
> difficult. I don’t see a large pool of contributors specifically
> focused on this area who could take ownership and drive releases for a
> single repository. Additionally, the ASF doesn’t officially allow
> repo-level committers or PMC members who could be recruited solely to
> manage one repository. Given these constraints, I suggest dropping
> this idea for now.
>
> Best,
> Ayush
>
> On Tue, 26 Nov 2024 at 01:05, Denys Kuzmenko  wrote:
> >
> > Hi Peter,
> >
> > Thanks for bringing it up!
> >
> > I think that option 1 is the only viable solution here (remove the
> hive-runtime from the iceberg repo). Main reason: lack of reviewers for
> things other than Spark.
> >
> > Note: need to double check, but I am pretty sure there is no difference
> between Hive `iceberg-catalog` and iceberg's `hive-metastore`, so we could
> potentially drop it from Hive repo and maybe rename to `hive-catalog` in
> iceberg?
> >
> > Supporting one more connector repo seems like an overhead: need to setup
> infra, CI, have active contributors/release managers. Later probably is the
> reason why we still haven't moved HMS into a separate repo.
> >
> > Having iceberg connector in Hive gives us more flexibility and ownership
> of that component, doesn't block an active development.
> > We try to be up-to-date with latest iceberg, but it usually takes some
> time.
> >
> > I'd be glad to hear other opinions.
> >
> > Thanks,
> > Denys
>

Re: [DISCUSS] Hive Support

2024-11-20 Thread Wing Yew Poon

Also to clarify --
It is my understanding that removing the hive-metastore module is NOT under
consideration; is that correct?
We still need a Hive version to depend on for the hive-metastore module. In
https://github.com/apache/iceberg/pull/10996, this is Hive 3. Does this
present any problem?


On Tue, Nov 19, 2024 at 10:26 PM Manu Zhang  wrote:

> To clarify, the changes discussed here don't affect hive connectors in
> engines, which either use the built-in hive version (Spark) or can be
> upgraded to hive 3 (Flink).
>
> On Wed, Nov 20, 2024 at 2:19 PM Manu Zhang 
> wrote:
>
>> Okay, let me add this option
>>
>> D. Drop Hive 2 & 3 support and suggest to use built-in Iceberg support of
>> Hive 4
>>
>> On Wed, Nov 20, 2024 at 2:00 PM Cheng Pan  wrote:
>>
>>> Hive 4 brings built-in support for Iceberg format, duplicated
>>> implementation in both sides look a redundant stuff.
>>>
>>> As Hive 2 and 3 do not support Java 11+, and Iceberg 1.8 requires Java
>>> 11+, the combination is invalid. How about simply dropping support for Hive
>>> 2&3 and suggesting the Hive user upgrade Hive 4 to gain the built-in
>>> Iceberg support?
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>>
>>> On Nov 20, 2024, at 12:47, Manu Zhang  wrote:
>>>
>>> Hi all,
>>>
>>> We previously reached consensus[1] to deprecate Hive 2 in 1.7 and drop
>>> in 1.8. However, when working on the removal PR[2], multiple tests failed
>>> in Hive 3 due to not supporting JDK11[3]. The fix has been back-ported to
>>> branch-3.1[4] but not released yet. As announced on Hive website, Hive 3.x
>>> is declared as End of Life so there will be no more Hive 3 release.
>>> Peter(@pvary) suggested upgrading to Hive 4 instead. On the other hand,
>>> iceberg-hive3 tests are already broken after we dropped JDK 8 support. It's
>>> not caught previously due to tests not running[6].
>>>
>>> Based on the current situation, here are the options I can think of to
>>> move forward
>>>
>>> A. Continue to remove Hive 2 in the current PR and upgrade to Hive 4 in
>>> a separate PR.
>>> B. Hold on removing Hive 2 until we upgrade to Hive 4
>>> C. Add source dependency[7] on Hive branch-3.1 or make a Hive 3.1
>>> release from a forked repo.
>>>
>>> 1. https://lists.apache.org/thread/zg14b8cor4lnbyd3t4n1297y2bwb1fsg
>>> 2. https://github.com/apache/iceberg/pull/10996
>>> 3. https://issues.apache.org/jira/browse/HIVE-21584
>>> 4. https://github.com/apache/hive/commits/branch-3.1/
>>> 5. https://hive.apache.org/general/downloads/
>>> 6. https://github.com/apache/iceberg/pull/11584
>>> 7. https://blog.gradle.org/introducing-source-dependencies
>>>
>>> Which option do you prefer? Any better alternative?
>>>
>>> Thanks,
>>> Manu
>>>
>>>
>>>

Re: [DISCUSS] Hive Support

2025-01-06 Thread Wing Yew Poon

FYI --
It looks like the built-in Hive version in the master branch of Apache
Spark is 2.3.10 (https://issues.apache.org/jira/browse/SPARK-47018), and
https://issues.apache.org/jira/browse/SPARK-44114 (upgrade built-in Hive to
3+) is an open issue.


On Mon, Jan 6, 2025 at 1:07 PM Wing Yew Poon  wrote:

> Hi Peter,
> In Spark, you can specify the Hive version of the metastore that you want
> to use. There is a configuration, spark.sql.hive.metastore.version, which
> currently (as of Spark 3.5) defaults to 2.3.9, and the jars supporting this
> default version are shipped with Spark as built-in. You can specify a
> different version and then specify spark.sql.hive.metastore.jars=path (the
> default is built-in) and spark.sql.hive.metastore.jars.path to point to
> jars for the Hive metastore version you want to use. What
> https://issues.apache.org/jira/browse/SPARK-45265 does is to allow 4.0.x
> to be supported as a spark.sql.hive.metastore.version. I haven't been
> following Spark 4, but I suspect that the built-in version is not changing
> to Hive 4.0. The built-in version is also used for other things that Spark
> may use from Hive (aside from interaction with HMS), such as Hive SerDes.
> See https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
> .
> - Wing Yew
>
>
> On Mon, Jan 6, 2025 at 2:04 AM Péter Váry 
> wrote:
>
>> Hi Manu,
>>
>> > Spark has only added hive 4.0 metastore support recently for Spark
>> 4.0[1] and there will be conflicts
>>
>> Does this mean that Spark 4.0 will always use Hive 4 code? Or it will use
>> Hive 2 when it is present on the classpath, but if older Hive versions are
>> not on the classpath then it will use the embedded Hive 4 code?
>>
>> > Firstly, upgrading from Hive 2 to Hive 4 is a huge change
>>
>> Is this a huge change even after we remove the Hive runtime module?
>>
>> After removing the Hive runtime module, we have 2 remaining Hive
>> dependencies:
>>
>>- HMS Client
>>   - The Thrift API should not be changed between the Hive versions,
>>   so unless we start to use specific Hive 4 features we should be fine 
>> here -
>>   so whatever version of Hive we use, it should work
>>   - Java API changes. We found that in Hive 2, and Hive 3 the
>>   HMSClient classes used different constructors so we ended up using
>>   DynMethods to use the appropriate constructors - if we use a strict 
>> Hive
>>   version here, then we won't need the DynMethods anymore
>>   - Based on our experience, even if Hive 3 itself doesn't support
>>   Java 11, the HMS Client for Hive 3 doesn't have any issues when used 
>> with
>>   Java 11
>>- Testing infrastructure
>>   - TestHiveMetastore creates and starts a HMS instance. This could
>>   be highly dependent on the version of Hive we are using. Since this is 
>> only
>>   a testing code I expect that only our tests are interacting with this
>>
>> *@Manu*: You know more of the details here. Do we have HMSClient issues
>> when we use Hive 4 code? If I miss something in the listing above, please
>> correct me.
>>
>> Based on this, in an ideal world:
>>
>>- Hive would provide a HMS client jar which only contains java code
>>which is needed to connect and communicate using Thrift with a HMS 
>> instance
>>(no internal HMS server code etc). We could use it as a dependency for our
>>iceberg-hive-metastore module. Either setting a minimal version, or using 
>> a
>>shaded embedded version. *@Hive* folks - is this a valid option? What
>>are the reasons that there is no metastore-client jar provided currently?
>>Would it be possible to generate one in some of the future Hive releases.
>>Seems like a worthy feature for me.
>>- We would create our version dependent HMS infrastructure if we want
>>to support Spark versions which support older Hive versions.
>>
>> As a result of this, we could have:
>>
>>- Clean definition of which Hive version is supported
>>- Testing for the supported Hive versions
>>- Java 11 support
>>
>> As an alternative we can create a testing matrix where some tests are run
>> with both Hive 3 and Hive 4, and some tests are run with only Hive3 (older
>> Spark versions which does not support Hive 4)
>>
>> Thanks Manu for driving this!
>> Peter
>>
>> Manu Zhang  ezt írta (időpont: 2025. jan. 5.,
>> V, 5:18):
>>
>>> This basically means that we need to support every exact Hive versions
>>

Re: [DISCUSS] Hive Support

2025-01-06 Thread Wing Yew Poon

Hi Peter,
In Spark, you can specify the Hive version of the metastore that you want
to use. There is a configuration, spark.sql.hive.metastore.version, which
currently (as of Spark 3.5) defaults to 2.3.9, and the jars supporting this
default version are shipped with Spark as built-in. You can specify a
different version and then specify spark.sql.hive.metastore.jars=path (the
default is built-in) and spark.sql.hive.metastore.jars.path to point to
jars for the Hive metastore version you want to use. What
https://issues.apache.org/jira/browse/SPARK-45265 does is to allow 4.0.x to
be supported as a spark.sql.hive.metastore.version. I haven't been
following Spark 4, but I suspect that the built-in version is not changing
to Hive 4.0. The built-in version is also used for other things that Spark
may use from Hive (aside from interaction with HMS), such as Hive SerDes.
See https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html.
- Wing Yew


On Mon, Jan 6, 2025 at 2:04 AM Péter Váry 
wrote:

> Hi Manu,
>
> > Spark has only added hive 4.0 metastore support recently for Spark
> 4.0[1] and there will be conflicts
>
> Does this mean that Spark 4.0 will always use Hive 4 code? Or it will use
> Hive 2 when it is present on the classpath, but if older Hive versions are
> not on the classpath then it will use the embedded Hive 4 code?
>
> > Firstly, upgrading from Hive 2 to Hive 4 is a huge change
>
> Is this a huge change even after we remove the Hive runtime module?
>
> After removing the Hive runtime module, we have 2 remaining Hive
> dependencies:
>
>- HMS Client
>   - The Thrift API should not be changed between the Hive versions,
>   so unless we start to use specific Hive 4 features we should be fine 
> here -
>   so whatever version of Hive we use, it should work
>   - Java API changes. We found that in Hive 2, and Hive 3 the
>   HMSClient classes used different constructors so we ended up using
>   DynMethods to use the appropriate constructors - if we use a strict Hive
>   version here, then we won't need the DynMethods anymore
>   - Based on our experience, even if Hive 3 itself doesn't support
>   Java 11, the HMS Client for Hive 3 doesn't have any issues when used 
> with
>   Java 11
>- Testing infrastructure
>   - TestHiveMetastore creates and starts a HMS instance. This could
>   be highly dependent on the version of Hive we are using. Since this is 
> only
>   a testing code I expect that only our tests are interacting with this
>
> *@Manu*: You know more of the details here. Do we have HMSClient issues
> when we use Hive 4 code? If I miss something in the listing above, please
> correct me.
>
> Based on this, in an ideal world:
>
>- Hive would provide a HMS client jar which only contains java code
>which is needed to connect and communicate using Thrift with a HMS instance
>(no internal HMS server code etc). We could use it as a dependency for our
>iceberg-hive-metastore module. Either setting a minimal version, or using a
>shaded embedded version. *@Hive* folks - is this a valid option? What
>are the reasons that there is no metastore-client jar provided currently?
>Would it be possible to generate one in some of the future Hive releases.
>Seems like a worthy feature for me.
>- We would create our version dependent HMS infrastructure if we want
>to support Spark versions which support older Hive versions.
>
> As a result of this, we could have:
>
>- Clean definition of which Hive version is supported
>- Testing for the supported Hive versions
>- Java 11 support
>
> As an alternative we can create a testing matrix where some tests are run
> with both Hive 3 and Hive 4, and some tests are run with only Hive3 (older
> Spark versions which does not support Hive 4)
>
> Thanks Manu for driving this!
> Peter
>
> Manu Zhang  ezt írta (időpont: 2025. jan. 5., V,
> 5:18):
>
>> This basically means that we need to support every exact Hive versions
>>> which are used by Spark, and we need to exclude our own Hive version from
>>> the Spark runtime.
>>
>>
>> Firstly, upgrading from Hive 2 to Hive 4 is a huge change, and I expect
>> compatibility to be much better once Iceberg and Spark are both on Hive 4.
>>
>> Secondly, the coupling can be loosed if we are moving toward the REST
>> catalog.
>>
>> On Fri, Jan 3, 2025 at 7:26 PM Péter Váry 
>> wrote:
>>
>>> That sounds really interesting in a bad way :) :(
>>>
>>> This basically means that we need to support every exact Hive versions
>>> which are used by Spark, and we need to exclude our own Hive version from
>>> the Spark runtime.
>>>
>>> On Thu, Dec 19, 2024, 04:00 Manu Zhang  wrote:
>>>
 Hi Peter,

> I think we should make sure that the Iceberg Hive version is
> independent from the version used by Spark

  I'm afraid that is not how it works currently. When Spark is deployed
 with hive libraries (I suppose thi

Re: Welcome Huaxin Gao as a committer!

2025-02-06 Thread Wing Yew Poon

Congratulations Huaxin! Awesome!


On Thu, Feb 6, 2025 at 9:27 AM Yufei Gu  wrote:

> Congrats Huaxin!
>
> Yufei
>
>
> On Thu, Feb 6, 2025 at 9:09 AM Steve Zhang 
> wrote:
>
>> Congratulations Huaxin, well deserved!
>>
>> Thanks,
>> Steve Zhang
>>
>>
>>
>> On Feb 6, 2025, at 8:16 AM, Xingyuan Lin 
>> wrote:
>>
>> Congrats Huaxin!
>>
>> On Thu, Feb 6, 2025 at 11:11 AM Denny Lee  wrote:
>>
>>> Congratulations Huaxin!!!
>>>
>>> On Thu, Feb 6, 2025 at 7:47 AM Amogh Jahagirdar <2am...@gmail.com>
>>> wrote:
>>>
 Congratulations Huaxin!

 On Thu, Feb 6, 2025 at 8:41 AM Kevin Liu  wrote:

> Congratulations Huaxin!! Looking forward to working together 🎉
> 
>
> Best,
> Kevin Liu
>
> On Thu, Feb 6, 2025 at 7:30 AM Prashant Singh <
> prashant010...@gmail.com> wrote:
>
>> Congratulations Huaxin !
>>
>> Best,
>> Prashant Singh
>>
>>
>> On Thu, Feb 6, 2025 at 7:25 AM himadri pal  wrote:
>>
>>> Congratulations Huaxin.
>>>
>>> On Thu, Feb 6, 2025 at 6:45 AM Sung Yun  wrote:
>>>
 That's fantastic news Huaxin. Congratulations!

 On 2025/02/06 13:40:09 Rodrigo Meneses wrote:
 > Congrats and best wishes !!!
 >
 > On Thu, Feb 6, 2025 at 5:04 AM Gidon Gershinsky 
 wrote:
 >
 > > Congrats Huaxin!
 > >
 > > Cheers, Gidon
 > >
 > >
 > > On Thu, Feb 6, 2025 at 2:46 PM Tushar Choudhary <
 > > tushar.choudhary...@gmail.com> wrote:
 > >
 > >> Congratulations Husain!
 > >>
 > >> Cheers,
 > >> Tushar Choudhary
 > >>
 > >>
 > >> On Thu, 6 Feb 2025 at 6:15 PM, xianjin 
 wrote:
 > >>
 > >>> Congrats huaxin!
 > >>> Sent from my iPhone
 > >>>
 > >>> On Feb 6, 2025, at 7:35 PM, Fokko Driesprong <
 fo...@apache.org> wrote:
 > >>>
 > >>> 
 > >>>
 > >>> Congratulations Huaxin!
 > >>>
 > >>> Op do 6 feb 2025 om 12:21 schreef Russell Spitzer <
 > >>> russell.spit...@gmail.com>:
 > >>>
 >  Congratulations!
 > 
 >  On Thu, Feb 6, 2025 at 11:35 AM Péter Váry <
 peter.vary.apa...@gmail.com>
 >  wrote:
 > 
 > > Congratulations!
 > >
 > > Matt Topol  ezt írta (időpont:
 2025. febr.
 > > 6., Cs, 10:40):
 > >
 > >> Congrats! Welcome!
 > >>
 > >> On Thu, Feb 6, 2025, 10:19 AM Raúl Cumplido <
 rau...@apache.org>
 > >> wrote:
 > >>
 > >>> Congrats Huaxin!
 > >>>
 > >>> El jue, 6 feb 2025 a las 10:16, Gang Wu (<
 ust...@gmail.com>)
 > >>> escribió:
 > >>>
 >  Congrats Huaxin!
 > 
 >  Best,
 >  Gang
 > 
 >  On Thu, Feb 6, 2025 at 5:10 PM Szehon Ho <
 szehon.apa...@gmail.com>
 >  wrote:
 > 
 > > Hi everyone,
 > >
 > > The Project Management Committee (PMC) for Apache
 Iceberg has
 > > invited Huaxin Gao to become a committer, and I am
 happy to
 > > announce that she has accepted.  Huaxin has done a lot
 > > of impressive work in areas such as Iceberg-Spark
 integration and recently
 > > Iceberg-Comet integrations.  Thanks Huaxin for all your
 hard work!
 > >
 > > Please join us in welcoming her!
 > >
 > > Thanks,
 > > Szehon
 > > On behalf of the Iceberg PMC
 > >
 > 
 >

>>>
>>>
>>> --
>>> Regards,
>>> Himadri Pal
>>>
>>
>>

Re: Changelog scan for table with delete files

2025-02-10 Thread Wing Yew Poon

Hi Anton,

Thank you for looking at https://github.com/apache/iceberg/pull/10935. I
think we are in agreement on the behavior, but you have concerns about the
performance of the scan, which I agree is justified. It has been some
months now. Do you have any suggestions for improving the performance? How
can we move forward with this? Can we get a working implementation in first
and optimize it later?

- Wing Yew


On Sat, Oct 5, 2024 at 10:53 PM Anton Okolnychyi 
wrote:

> I will take a look next week!
>
> субота, 5 жовтня 2024 р. Péter Váry  пише:
>
>> Hi Team,
>>
>> Gentle reminder, that the PR for the changelog planning (
>> https://github.com/apache/iceberg/pull/10935) is still waiting for
>> expert reviews.
>>
>> Thanks, Peter
>>
>> On Tue, Oct 1, 2024, 06:46 Yufei Gu  wrote:
>>
>>> Thanks, Peter and Wing Yew Poon, for tackling these! I’ve been eager to
>>> review, but this week has been hectic. I plan to check out PR #10935 next
>>> week, though I’d be happy if someone beats me to it.
>>>
>>> Yufei
>>>
>>>
>>> On Mon, Sep 30, 2024 at 3:02 AM Péter Váry 
>>> wrote:
>>>
>>>> Hi Team,
>>>>
>>>> The Changelog scan Java API interfaces were created a long time ago by
>>>> Anton, but it has not been implemented until yet. There is a Spark
>>>> specific SQL implementation for the feature, but the feature is not
>>>> available on the Java API.
>>>>
>>>> The Flink CDC streaming read is one of the often required features [1]
>>>> [2]. Flink needs the Java API to provide streaming reads for tables with
>>>> deletes.
>>>>
>>>> Wing Yew Poon implemented the Java API [3]. I did my best reviewing
>>>> the PR, but I am not an expert on this part of the code. I would like to
>>>> ask some of the planning experts (or anyone else for that matter), to take
>>>> a look and validate too.
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> [1] - https://github.com/apache/iceberg/issues/5623
>>>> [2] -
>>>> https://github.com/apache/iceberg/issues/5803#issuecomment-1259759074
>>>> [3] - https://github.com/apache/iceberg/pull/10935
>>>>
>>>

missing files in an Iceberg table

2025-01-27 Thread Wing Yew Poon

Hi,
A surprising number of our customers have inadvertently deleted files that
are part of their Iceberg tables (from storage), both data and metadata.
This has caused their Iceberg tables to be unreadable (or unloadable in the
case of missing metadata).
In the case of missing data files, we have provided code to the customer to
"repair" the table to make it readable again without the missing files
(where they are not able to recover the files at all). I have put up a PR,
https://github.com/apache/iceberg/pull/12106, for a Spark action to remove
missing data and delete files from table metadata. Perhaps this would be
useful to others.
I have kept the action simple. Removing a data file may result in dangling
deletes but the action does not do anything about that. However, running
rewrite_position_deletes_files or rewrite_data_files subsequently would
clean them up.
Repairing a table with missing metadata is more difficult and depends on
what metadata files are missing.
- Wing Yew

Re: missing files in an Iceberg table

2025-01-28 Thread Wing Yew Poon

Dan,
Thanks for the pointers. Let me look into that work.
- Wing Yew


On Tue, Jan 28, 2025 at 8:49 AM Daniel Weeks  wrote:

> Hey Wing Yew,
>
> I would agree that this is a common problem and we need a way to get
> tables back into a good state when something unexpected happens.  Amogh and
> Matt have a PR (API: Define RepairManifests action interface
> <https://github.com/apache/iceberg/pull/10784#top>
> #10784) that was originally intended to address this and was part of some
> other changes (here <https://github.com/apache/iceberg/pull/10711> and
> here <https://github.com/apache/iceberg/pull/10721>), to provide
> mechanisms to recover files where possible (e.g. versioned buckets or HDFS
> trash).
>
> I think this lost a little momentum over the holidays, but it would be
> great if you could work with them to come finalize this work,
>
> -Dan
>
> On Tue, Jan 28, 2025 at 7:16 AM Zach Dischner 
> wrote:
>
>> Hi Wing,
>>
>> Thank you for bringing this up. We run into this all the time,
>> particularly when the underlying storage has data management settings
>> outside of Iceberg's ownership (I.E. s3 retention policies). It is probably
>> a weekly occurrence, and one of the biggest pain points for new builders.
>> Thanks for kicking this off!
>>
>> Zach
>>
>> On Tue, Jan 28, 2025 at 5:36 AM Gabor Kaszab 
>> wrote:
>>
>>> Hi,
>>>
>>> I can also confirm that there are a number of users who find themselves
>>> unintentionally deleting some files and not being able to use their Iceberg
>>> tables anymore. The number of these incidents is surprisingly high for some
>>> reason. There was also a question on Iceberg Slack around this problem the
>>> other day. So I think it's reasonable to provide some recovery mechanisms
>>> in the Iceberg lib in some form to the users.
>>>
>>> I went through the PR for my own education and left some comments,
>>> mostly around the introduced table API for this. Please let me know if any
>>> of this makes sense.
>>>
>>> Cheers,
>>> Gabor
>>>
>>> On Mon, Jan 27, 2025 at 6:10 PM Wing Yew Poon
>>>  wrote:
>>>
>>>> Hi,
>>>> A surprising number of our customers have inadvertently deleted files
>>>> that are part of their Iceberg tables (from storage), both data and
>>>> metadata. This has caused their Iceberg tables to be unreadable (or
>>>> unloadable in the case of missing metadata).
>>>> In the case of missing data files, we have provided code to the
>>>> customer to "repair" the table to make it readable again without the
>>>> missing files (where they are not able to recover the files at all). I have
>>>> put up a PR, https://github.com/apache/iceberg/pull/12106, for a Spark
>>>> action to remove missing data and delete files from table metadata. Perhaps
>>>> this would be useful to others.
>>>> I have kept the action simple. Removing a data file may result in
>>>> dangling deletes but the action does not do anything about that. However,
>>>> running rewrite_position_deletes_files or rewrite_data_files subsequently
>>>> would clean them up.
>>>> Repairing a table with missing metadata is more difficult and depends
>>>> on what metadata files are missing.
>>>> - Wing Yew
>>>>
>>>>
>>
>> --
>> Zach Dischner
>> 303-919-1364 | zach.disch...@gmail.com
>> Senior Software Development Engineer | Amazon Advertising
>> zachdischner.com <http://www.zachdischner.com/> | Flickr
>> <http://www.flickr.com/photos/zachd1_618/> | Smugmug
>> <http://zachdischner.smugmug.com/> | 2manventure
>> <http://2manventure.wordpress.com/>
>>
>

Re: Changelog scan for table with delete files

2025-02-14 Thread Wing Yew Poon

Ok Anton. Please let me know.


On Thu, Feb 13, 2025 at 9:28 PM Anton Okolnychyi 
wrote:

> Hey Wing Yew, I am planning to focus on this after we get partition stats
> readers/writers into main. I actually have ideas on how to implement
> changelog scans for V2 tables efficiently.
>
> - Anton
>
> пн, 10 лют. 2025 р. о 21:11 Wing Yew Poon 
> пише:
>
>> Hi Anton,
>>
>> Thank you for looking at https://github.com/apache/iceberg/pull/10935. I
>> think we are in agreement on the behavior, but you have concerns about the
>> performance of the scan, which I agree is justified. It has been some
>> months now. Do you have any suggestions for improving the performance? How
>> can we move forward with this? Can we get a working implementation in first
>> and optimize it later?
>>
>> - Wing Yew
>>
>>
>> On Sat, Oct 5, 2024 at 10:53 PM Anton Okolnychyi 
>> wrote:
>>
>>> I will take a look next week!
>>>
>>> субота, 5 жовтня 2024 р. Péter Váry  пише:
>>>
>>>> Hi Team,
>>>>
>>>> Gentle reminder, that the PR for the changelog planning (
>>>> https://github.com/apache/iceberg/pull/10935) is still waiting for
>>>> expert reviews.
>>>>
>>>> Thanks, Peter
>>>>
>>>> On Tue, Oct 1, 2024, 06:46 Yufei Gu  wrote:
>>>>
>>>>> Thanks, Peter and Wing Yew Poon, for tackling these! I’ve been eager
>>>>> to review, but this week has been hectic. I plan to check out PR #10935
>>>>> next week, though I’d be happy if someone beats me to it.
>>>>>
>>>>> Yufei
>>>>>
>>>>>
>>>>> On Mon, Sep 30, 2024 at 3:02 AM Péter Váry <
>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>
>>>>>> Hi Team,
>>>>>>
>>>>>> The Changelog scan Java API interfaces were created a long time ago
>>>>>> by Anton, but it has not been implemented until yet. There is a Spark
>>>>>> specific SQL implementation for the feature, but the feature is not
>>>>>> available on the Java API.
>>>>>>
>>>>>> The Flink CDC streaming read is one of the often required features
>>>>>> [1] [2]. Flink needs the Java API to provide streaming reads for tables
>>>>>> with deletes.
>>>>>>
>>>>>> Wing Yew Poon implemented the Java API [3]. I did my best reviewing
>>>>>> the PR, but I am not an expert on this part of the code. I would like to
>>>>>> ask some of the planning experts (or anyone else for that matter), to 
>>>>>> take
>>>>>> a look and validate too.
>>>>>>
>>>>>> Thanks,
>>>>>> Peter
>>>>>>
>>>>>> [1] - https://github.com/apache/iceberg/issues/5623
>>>>>> [2] -
>>>>>> https://github.com/apache/iceberg/issues/5803#issuecomment-1259759074
>>>>>> [3] - https://github.com/apache/iceberg/pull/10935
>>>>>>
>>>>>

Re: [VOTE] Simplify multi-arg table metadata

2025-02-11 Thread Wing Yew Poon

+1 (non-binding)


On Mon, Feb 10, 2025 at 10:26 AM Yufei Gu  wrote:

> +1
> Yufei
>
>
> On Mon, Feb 10, 2025 at 9:48 AM Steve Zhang
>  wrote:
>
>> +1 (non-binding).
>>
>> Thanks,
>> Steve Zhang
>>
>>
>>
>> On Feb 9, 2025, at 1:01 AM, Fokko Driesprong  wrote:
>>
>> (Second attempt, the cat  ran over the keyboard)
>>
>> Hey everyone,
>>
>> After the positive responses on the devlist
>> , I
>> would like to raise a vote to simplify the multi-argument transforms
>> metadata and make it exclusive for V3+ tables. The corresponding PR can be
>> found here .
>>
>> This vote will be open for at least 72 hours.
>>
>> [ ] +1 Update the metadata to remove multi-arg transforms for V1 and V2
>> tables
>> [ ] +0
>> [ ] -1 I have questions and/or concerns
>>
>> Kind regards,
>> Fokko
>>
>> Op zo 9 feb 2025 om 09:57 schreef Driesprong, Fokko > >:
>>
>>> Hey everyone,
>>>
>>> After the positive responses on the devlist
>>> , I
>>> would like to raise a vote to simplify the multi-argument transforms
>>> metadata, and make it exclusve
>>>
>>> A vote to simplify the
>>>
>>
>>

Re: [DISCUSS] Cleanup unreferenced statistics files through DropTableData

2025-03-07 Thread Wing Yew Poon

Gabor kindly pointed out to me in direct communication that I was mistaken
to assert that "any files that already appear as `orphan` in current
metadata.json are safe to remove." At the time a new metadata.json is
committed adding a file to an `orphan` list, a reader could be performing a
read using the previous metadata.json and be planning to read the
now-orphan statistics file. If a process that reads the new current
metadata.json soon after then deletes the orphan file, that could be
deleting the file from under that reader.
I therefore do not think removing the previous orphan statistics file for a
snapshot as part of updating existing statistics for that snapshot is a
good idea. However, I think removing the orphan files as part of snapshot
expiration is fine. There is always the potential to remove files from
under a reader with snapshot expiration, but this is generic to all files
associated with a snapshot and we live with this "unsafeness".


On Wed, Mar 5, 2025 at 7:01 PM Wing Yew Poon  wrote:

> Hi Gabor,
>
> I agree that with the use of table and partition statistics (and possibly
> other auxiliary files in the future), this problem of orphan files due to
> recomputation of existing statistics (replacing existing files without
> deleting them) will grow. I agree that while remove_orphan_files would
> delete such files, it would be good to have some other mechanism for
> cleaning them up.
>
> Your proposal is interesting and I think it is feasible. It would require
> a spec change. Can we introduce a change for this in v3? If so, I'd
> suggest, for handling the existing cases of table statistics and partition
> statistics, to introduce two fields in table metadata, `orphan-statistics`
> and `orphan-partition-statistics`, which will be a list of table statistics
> and a list of partition statistics respectively. If we want to be more
> general, maybe we can have `orphan-files` instead, which will also be a
> list. The (table) `statistics` and `partition-statistics` structs already
> contain `snapshot-id` fields, so I don't think we need a map of snapshot-id
> to file. For future use cases, where a map keyed by snapshot-id could be
> useful, you are already assuming the files used correspond to snapshots, so
> it would also make sense for the struct representing them to contain
> snapshot-id.
>
> When table statistics or partition statistics for a snapshot are updated,
> if there are existing statistics for that snapshot, the existing file needs
> to be written into this `orphan-*` list. I don't think we need to use the
> mechanism of 'write.statistics.delete-after-commit.enabled' and
> 'write.statistics.previous-versions-max'. I think that if we require the
> orphan files to be cleaned up (the list trimmed in metadata and the files
> deleted) during snapshot expiration, that might be enough, if snapshot
> expiration is run frequently enough. If we want, as an additional/optional
> way to clean up these orphan files, when table statistics or partition
> statistics for a snapshot are updated, in addition to writing an existing
> file into the `orphan-*` list, any file in the `orphan-*` list for the same
> snapshot needs to be deleted and removed from the list as well. Note that
> any files that already appear as `orphan` in current metadata.json are safe
> to remove. (We still need snapshot expiration to remove all referenced
> orphan files for old snapshots, but this would potentially keep the lists
> shorter.) However, I think this is extra.
>
> What do folks think?
>
> - Wing Yew
>
> ps. I also found that the `statistics` and `partition-statistics`fields in
> table metadata are lists, with the unwritten expectation (that is to say,
> not written into the spec) that for each snapshot, there is at most one
> file in the list. I also thought about the idea that we could just add
> updated statistics to the list without removing the existing statistics
> (this would be allowed by the spec) and ensuring that the first one (or
> last one) for a snapshot is the latest one and thus the one to use. This
> way, we don't need a spec change, but much existing implementation would
> need to change and I think it is too complicated anyway.
>
>
> On Thu, Feb 27, 2025 at 5:31 AM Gabor Kaszab 
> wrote:
>
>> Thanks for the discussion on this topic during the community sync!
>> Let me sum up what we discussed and also follow-up with some additional
>> thoughts.
>>
>> *Summary:*
>> As long as the table is there users can run orphan file cleanup to remove
>> the orphaned stat files.
>> If you drop the table the orphaned stat files will remain on disk
>> unfortunately. This is however a catalog matter for the location ownership
>

Re: [DISCUSS] Cleanup unreferenced statistics files through DropTableData

2025-03-05 Thread Wing Yew Poon

Hi Gabor,

I agree that with the use of table and partition statistics (and possibly
other auxiliary files in the future), this problem of orphan files due to
recomputation of existing statistics (replacing existing files without
deleting them) will grow. I agree that while remove_orphan_files would
delete such files, it would be good to have some other mechanism for
cleaning them up.

Your proposal is interesting and I think it is feasible. It would require a
spec change. Can we introduce a change for this in v3? If so, I'd suggest,
for handling the existing cases of table statistics and partition
statistics, to introduce two fields in table metadata, `orphan-statistics`
and `orphan-partition-statistics`, which will be a list of table statistics
and a list of partition statistics respectively. If we want to be more
general, maybe we can have `orphan-files` instead, which will also be a
list. The (table) `statistics` and `partition-statistics` structs already
contain `snapshot-id` fields, so I don't think we need a map of snapshot-id
to file. For future use cases, where a map keyed by snapshot-id could be
useful, you are already assuming the files used correspond to snapshots, so
it would also make sense for the struct representing them to contain
snapshot-id.

When table statistics or partition statistics for a snapshot are updated,
if there are existing statistics for that snapshot, the existing file needs
to be written into this `orphan-*` list. I don't think we need to use the
mechanism of 'write.statistics.delete-after-commit.enabled' and
'write.statistics.previous-versions-max'. I think that if we require the
orphan files to be cleaned up (the list trimmed in metadata and the files
deleted) during snapshot expiration, that might be enough, if snapshot
expiration is run frequently enough. If we want, as an additional/optional
way to clean up these orphan files, when table statistics or partition
statistics for a snapshot are updated, in addition to writing an existing
file into the `orphan-*` list, any file in the `orphan-*` list for the same
snapshot needs to be deleted and removed from the list as well. Note that
any files that already appear as `orphan` in current metadata.json are safe
to remove. (We still need snapshot expiration to remove all referenced
orphan files for old snapshots, but this would potentially keep the lists
shorter.) However, I think this is extra.

What do folks think?

- Wing Yew

ps. I also found that the `statistics` and `partition-statistics`fields in
table metadata are lists, with the unwritten expectation (that is to say,
not written into the spec) that for each snapshot, there is at most one
file in the list. I also thought about the idea that we could just add
updated statistics to the list without removing the existing statistics
(this would be allowed by the spec) and ensuring that the first one (or
last one) for a snapshot is the latest one and thus the one to use. This
way, we don't need a spec change, but much existing implementation would
need to change and I think it is too complicated anyway.

On Thu, Feb 27, 2025 at 5:31 AM Gabor Kaszab  wrote:

> Thanks for the discussion on this topic during the community sync!
> Let me sum up what we discussed and also follow-up with some additional
> thoughts.
>
> *Summary:*
> As long as the table is there users can run orphan file cleanup to remove
> the orphaned stat files.
> If you drop the table the orphaned stat files will remain on disk
> unfortunately. This is however a catalog matter for the location ownership
> with the table.
>
> *My follow-up thoughts:*
> - Orphan file cleanup is not always feasible. e.g. when tables share
> locations.
> - Orphan files are expected when something goes wrong. With stat files now
> even successful queries could create orphan files.
> - With time it seems that there are more and more new ways of creating
> orphan files even in a successful use-case. Table stats, (soon coming)
> partition stats and who knows what else (col stats? indexes?). The
> situation might not be that severe now but could get worse over time.
> - Users seem to complain even for the /data and /metadata folders
> remaining on storage after a drop table. Remaining stat files could also be
> a reason for recurring complaints.
>
> I think even though orphan file removal (if feasible) could be a solution
> for the symptom here but I think the table format should offer a way to
> take care of the root cause (unreferencing files when updating them) too.
>
> *Proposal:*
> What I have in mind to tackle the root cause is to keep track not only the
> current stat files but also the historical ones from the table metadata.
> This could increase the size of the metadata for sure, but could be kept at
> a manageable size with a similar mechanism to what we do with the
> historical metadata.jsons. In practice we could have flags like:
> 'write.statistics.delete-after-commit.enabled'
> 'write.statistics.previous-versions-max

67 matches

Mail list logo