from:"Brian Olsen"

👋 Intro and question for the community

2023-05-18 Thread Brian Olsen

Hey all,
My name is Brian and I'm the new Head of Developer Relations working at
Tabular. I'd like to set up Common  Room
 for us to have a bit of a pulse on the
community. I would like to see if the community is interested in enabling
read-only

permissions for the apache/iceberg and apache/icberg-docs for the GitHub
integration. Here's how the information would be used:

   - Triage issues and PRs
   - Learn ways to improve developer/contributor experience in the community
   - Understand which PRs and issues are not getting attention and why
   - Set alerts and notifications for the Developer Relations team to
   follow up on issues to help drive changes in Iceberg
   - Metrics reporting to showcase Iceberg usage to drive further adoption
   and interest in Iceberg
   - Gaining a better understanding of the ways people use Iceberg and the
   features they are interested in
   - Showcase the diversity of contributions the Iceberg project

Is everyone okay with me setting this up so I can help the community with
things like roadmap updates and making sure we follow up on reviews?

Who owns the @ApacheIceberg Twitter account?

2023-05-19 Thread Brian Olsen

Hey all,

Does anyone here own or know who owns the @ApacheIceberg Twitter account?

https://twitter.com/ApacheIceberg

Re: 👋 Intro and question for the community

2023-05-30 Thread Brian Olsen

Great question!

I asked the same questions to Common Room and this is what they responded
with:

So with the app, we can pull deltas. With using the method with our own
> auth, we don’t. I’m not sure if it’s a limitation in how it was written,
> since our auth was written years ago, before we supported all the activity
> types. But I do understand that activities like stars would have to be
> repulled every time with our method, so we opt not to do that.

I looked into how their competitor does it
<https://orbit.love/docs/all/github-integration#aed6c486fcd746098531d5dda92a641c>
and they also do the same. I'm not sure why. I imagine scraping is much
more difficult and limited as the website is always subject to change. We
used Orbit in the Trino community and none of the data they scraped
required private access. So my best guess is that GitHub doesn't allow
incremental updates and no way of scraping the site without pulling
everything, all at once. Which is what they do when you run a proof of
concept with them.

Other Apache communities have connected with them so maybe a next step
could be to reach out to some of those communities or I'd be happy to bring
anyone who is interested on to a phone call with Common Room to ask any
other quesitons.

Let me know what you think.

On Tue, May 30, 2023 at 10:54 AM Russell Spitzer 
wrote:

> Could you please elaborate on what Common room really is and why it needs
> special permissions? I'm would have thought just generic public access
> would be enough to check PR's, Issues and such?
>
> On Tue, May 23, 2023 at 7:17 PM Anton Okolnychyi
>  wrote:
>
>> Seems valuable to me.
>>
>> - Anton
>>
>> On May 18, 2023, at 2:44 PM, Brian Olsen  wrote:
>>
>> Hey all,
>> My name is Brian and I'm the new Head of Developer Relations working at
>> Tabular. I'd like to set up Common <https://www.commonroom.io/> Room
>> <https://www.commonroom.io/> for us to have a bit of a pulse on the
>> community. I would like to see if the community is interested in enabling
>> read-only
>> <https://docs.commonroom.io/get-started/integrations/github#required-permissions>
>> permissions for the apache/iceberg and apache/icberg-docs for the GitHub
>> integration. Here's how the information would be used:
>>
>>- Triage issues and PRs
>>- Learn ways to improve developer/contributor experience in the
>>community
>>- Understand which PRs and issues are not getting attention and why
>>- Set alerts and notifications for the Developer Relations team to
>>follow up on issues to help drive changes in Iceberg
>>- Metrics reporting to showcase Iceberg usage to drive further
>>adoption and interest in Iceberg
>>- Gaining a better understanding of the ways people use Iceberg and
>>the features they are interested in
>>- Showcase the diversity of contributions the Iceberg project
>>
>> Is everyone okay with me setting this up so I can help the community with
>> things like roadmap updates and making sure we follow up on reviews?
>>
>>
>>

Re: 👋 Intro and question for the community

2023-05-30 Thread Brian Olsen

I’ve spoken to Ryan and the understanding is that anyone in the community
would have access upon request. Assuming they’re only getting read roles to
public repos like these apps request for anyways.

Tools like Orbit have a free open source tier that can be used but it is
limited to 3-5 users. We are paying for Common Room in Tabular and
Confluent/Kafka and Imply/Druid have this integration set up as well.

I’d be happy to discuss getting one of these set up for PMC usage but I
would mostly be using the one in Tabular. So the PMC one would be a shared
way to manage a view off the community for PMC work vs DevRel work.

Does that make sense?

On Tue, May 30, 2023 at 11:57 AM Jack Ye  wrote:

> Seems like a valuable and interesting product to use!
>
> Are there any restrictions on Apache side to use such product integration?
> Is it a free product for us to use?
>
> Best,
> Jack Ye
>
> On Tue, May 30, 2023 at 9:23 AM Brian Olsen 
> wrote:
>
>> Great question!
>>
>> I asked the same questions to Common Room and this is what they responded
>> with:
>>
>> So with the app, we can pull deltas. With using the method with our own
>>> auth, we don’t. I’m not sure if it’s a limitation in how it was written,
>>> since our auth was written years ago, before we supported all the activity
>>> types. But I do understand that activities like stars would have to be
>>> repulled every time with our method, so we opt not to do that.
>>
>>
>> I looked into how their competitor does it
>> <https://orbit.love/docs/all/github-integration#aed6c486fcd746098531d5dda92a641c>
>> and they also do the same. I'm not sure why. I imagine scraping is much
>> more difficult and limited as the website is always subject to change. We
>> used Orbit in the Trino community and none of the data they scraped
>> required private access. So my best guess is that GitHub doesn't allow
>> incremental updates and no way of scraping the site without pulling
>> everything, all at once. Which is what they do when you run a proof of
>> concept with them.
>>
>> Other Apache communities have connected with them so maybe a next step
>> could be to reach out to some of those communities or I'd be happy to bring
>> anyone who is interested on to a phone call with Common Room to ask any
>> other quesitons.
>>
>> Let me know what you think.
>>
>>
>>
>> On Tue, May 30, 2023 at 10:54 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> Could you please elaborate on what Common room really is and why it
>>> needs special permissions? I'm would have thought just generic public
>>> access would be enough to check PR's, Issues and such?
>>>
>>> On Tue, May 23, 2023 at 7:17 PM Anton Okolnychyi
>>>  wrote:
>>>
>>>> Seems valuable to me.
>>>>
>>>> - Anton
>>>>
>>>> On May 18, 2023, at 2:44 PM, Brian Olsen 
>>>> wrote:
>>>>
>>>> Hey all,
>>>> My name is Brian and I'm the new Head of Developer Relations working at
>>>> Tabular. I'd like to set up Common <https://www.commonroom.io/> Room
>>>> <https://www.commonroom.io/> for us to have a bit of a pulse on the
>>>> community. I would like to see if the community is interested in enabling
>>>> read-only
>>>> <https://docs.commonroom.io/get-started/integrations/github#required-permissions>
>>>> permissions for the apache/iceberg and apache/icberg-docs for the GitHub
>>>> integration. Here's how the information would be used:
>>>>
>>>>- Triage issues and PRs
>>>>- Learn ways to improve developer/contributor experience in the
>>>>community
>>>>- Understand which PRs and issues are not getting attention and why
>>>>- Set alerts and notifications for the Developer Relations team to
>>>>follow up on issues to help drive changes in Iceberg
>>>>- Metrics reporting to showcase Iceberg usage to drive further
>>>>adoption and interest in Iceberg
>>>>- Gaining a better understanding of the ways people use Iceberg and
>>>>the features they are interested in
>>>>- Showcase the diversity of contributions the Iceberg project
>>>>
>>>> Is everyone okay with me setting this up so I can help the community
>>>> with things like roadmap updates and making sure we follow up on reviews?
>>>>
>>>>
>>>>

Re: 👋 Intro and question for the community

2023-06-06 Thread Brian Olsen

Hi Jean-Baptiste,

Common Room https://www.commonroom.io/, is an application used to
comprehensively understand activities happening across a community so that
a team focusing on developer relations can better respond to issues,
understand where bottlenecks exist, and many other potential applications
around optimizing releases and developer experience.

Three of the PMC have already replied to this list and after talking about
it with Ryan Blue he said this discussion would make more sense in a public
forum for all to see. That said I’m more than happy off the PMC would like
to make a formal vote around adding this capability if any of them feel
that is necessary.

On Tue, Jun 6, 2023 at 4:30 AM Jean-Baptiste Onofré  wrote:

> Hi Brian,
>
> Can you please describe a bit what you mean by Common Room ?
>
> At first glance, it looks like a good idea. However, from Apache
> standpoint, it has to be approved by the PMC members. Did you request
> so on the private mailing list ?
>
> Regards
> JB
>
> On Thu, May 18, 2023 at 11:44 PM Brian Olsen 
> wrote:
> >
> > Hey all,
> > My name is Brian and I'm the new Head of Developer Relations working at
> Tabular. I'd like to set up Common Room for us to have a bit of a pulse on
> the community. I would like to see if the community is interested in
> enabling read-only permissions for the apache/iceberg and
> apache/icberg-docs for the GitHub integration. Here's how the information
> would be used:
> >
> > Triage issues and PRs
> > Learn ways to improve developer/contributor experience in the community
> > Understand which PRs and issues are not getting attention and why
> > Set alerts and notifications for the Developer Relations team to follow
> up on issues to help drive changes in Iceberg
> > Metrics reporting to showcase Iceberg usage to drive further adoption
> and interest in Iceberg
> > Gaining a better understanding of the ways people use Iceberg and the
> features they are interested in
> > Showcase the diversity of contributions the Iceberg project
> >
> > Is everyone okay with me setting this up so I can help the community
> with things like roadmap updates and making sure we follow up on reviews?
>

Meeting Minutes from 2023-06-07 Iceberg Sync

2023-06-09 Thread Brian Olsen

Hi Iceberg Community,
Here are the minutes and recording from our Iceberg Sync. They will now be
posted to the new Apache Iceberg YouTube channel. <
https://www.youtube.com/playlist?list=PLkifVhhWtccwcQrNnjEPxbUPX9Q2eCAPO>

Always remember, anyone can join the discussion so feel free to share the
Iceberg-Sync <https://groups.google.com/g/iceberg-sync> Google group with
anyone seeking an invite.

The notes and the agenda are posted in the Iceberg Sync YouTube description.
<
https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web
>
that's
also attached to the meeting invitation and it's an excellent place to add
items as you see fit so we can discuss them in the following community sync.


Meeting Recording
<https://www.youtube.com/watch?v=2rOm5TOafxU>
⭕ / Meeting Transcript, can be found here in the video <
https://youtu.be/1lm4Wlpy2wU?t=28>

Attendees:
Alex Merced, Ashish Paliwal, Bijan Houle, Brian Olsen, Bryan Keller, Daniel
Weeks, Dennis Huo, Dmitri Bourlatchkov, Fokko Driesprong, Jack Ye,
Jacqueline Yeung, Jiao Yizheng, Jonas Jiang, Namratha Mysore
Keshavaprakash, Rajasekhar Konda, Ryan Blue, Shawn Gordon, Steen
Gundersborg, Steve Z, Vicky Bukta, Wing Yew Poon, mohan vamsi.

Highlights:
- Apache Iceberg 1.3.0 has been released :tada: :partying_face:
- Added encryption key and AAD to Parquet write builders (Thanks, Gidon!)
- Spark 3.4 supports timestamp_ntz (Thanks, Fokko!)
- Rebuilt Spark MERGE file handling (Thanks, Anton!)

Releases:
- PyIceberg 0.4.0 release
- Apache Iceberg 1.3.0

Discussion:
- Adaptive split planning in core and Spark (
https://github.com/apache/iceberg/pul...,
https://github.com/apache/iceberg/pul...)
- Multi-table transactions Catalog API (
https://github.com/apache/iceberg/pul...)
- Incremental scan API (https://github.com/apache/iceberg/pul...)
- Views status update
- Partition stats (https://github.com/apache/iceberg/pul...)
- Metadata deletion and gc.enabled

AI-generated chapter summaries:
0:00 Chapter 1
Brian, Daniel, Dmitri, and Jack discussed various updates and improvements
to different tools and systems, including the release of Apache iceberg
1.3, progress on encryption, the addition of UID and timestamp support in
Spark, and an overhaul of file handling in Spark's merge plan. They also
mentioned the need to discuss these updates further in the train-out
community.

5:33 Chapter 2
Jack suggests sharing their data preparation process with the Spark
community.

5:55 Chapter 3
The group discussed different approaches to split planning for parallelism
in Spark, including adapting split sizes based on the amount of data and
creating larger splits for larger scans. They also considered the right
amount of parallelism and balancing the number of workers with the number
of parents to achieve cost savings.

16:41 Chapter 4
The group discussed various topics including adaptive split planning and
multi-table transactions with extensions to the catalog API. They
considered different approaches and methods for implementation, with Fokko
expressing satisfaction with the new multi-table API.

 27:00 Chapter 5
Jack, Fokko, and others discussed various topics such as multi-table swaps,
incremental change log scan with deletes, partition stats, and publishing
stats. They explored the possibility of putting additional stats like NDVs
in the manifest level, but faced challenges in tracking sketches and
inflating metadata size. They considered options like partition level stats
and table level NDVs, and suggested calculating NDVs for individual
partitions to weigh off in the estimate.

37:35 Chapter 6
Jack and Daniel discussed the importance of tracking statistics like NDV
and in-memory column size at the partition level. They also talked about
the GC-enabled property and how it applies to tables that don't own their
own files.

47:17 Chapter 7
  - The group discussed the issue of metadata file deletion and whether it
should be controlled by GC-enabled. They considered various options,
including leaving all garbage files and having a flag to stop any cleanup,
but ultimately decided that changing the behavior of the library for
something that violates the spec would be problematic.

Thanks! See you all at the next sync!

Iceberg Kafka Connect Sink

2023-06-21 Thread Brian Olsen

Hey Iceberg Devs,

Just wanted to make you all aware of a project Tabular has been building
around a new Kafka Connect sink for Iceberg. There have been a few cases we
wanted to address for customers and Iceberg users alike that just weren't
enabling exactly-once semantics in the sink, commit coordination, or
multi-table fanout. There's links below if you want to learn more. We
started this under the Tabular org to iterate faster and we'd like to gauge
the interest in the community for us to contribute this once it's reached
full maturity.

Blog: https://tabular.io/blog/intro-kafka-connect/
Repo: https://github.com/tabular-io/iceberg-kafka-connect

Stay classy folks!
Bits

Meeting Minutes from 2023-06-28 Iceberg Sync

2023-07-02 Thread Brian Olsen

Hey Iceberg Nation!

Here are the minutes and recording from our Iceberg Sync. As a reminder,
anyone can join the discussion so feel free to share the Iceberg-Sync.

*NOTE:* Due to technical difficulties of folks not receiving invitations
from the Iceberg Sync Google Group
we will move to sharing the link through a public calendar posted on the
Apace Iceberg site <
https://iceberg.apache.org/community/#iceberg-community-events>. Please
test that you can find the longstanding Google meeting on the calendar
before the next meeting. I'll be making this a separate thread on the dev
list to discuss this or pose any questions or concerns there.

The notes and the agenda are posted in the Iceberg Sync YouTube
description, as well as, maintained in the meeting minute notes
<
https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
which is linked on the calendar invite.

Meeting Recording

⭕ / Meeting Transcript, can be found here in the video <
https://youtu.be/1lm4Wlpy2wU?t=28>

Highlights
-

OOM fix caused by Avro decoder caching
(Coney Liu)
-

Multiple shuffle partitions per file
(Anton)
-

Python: Add positional deletes
(Fokko)
-

Python: alter table in transactions
(Fokko)
-

View metadata implementation
(Eduard)
-

View support for InMemoryCatalog
(Eduard)
-

API for multi-table commits
(Eduard)
-

Catalog Transaction API is still in draft

Flink: Split ordering based on Sequence Number
(Peter)
-

Adding new Iceberg Events Calendar

including this sync. (Brian)
-

Any PMC that wants access to edit this calendar reach out to me on
Slack/Dev list. Will announce to dev list.
-
-

Releases
-

Python 0.4 vote is out! (Please verify)
-

Discussion
-

bloom/cuckoo/other filters in manifest
-

Priority-based commit
-

Documentation efforts #documentation

(Brian)
-

Proposed updates to the docs site

AI-generated chapter summaries:

- 0:00 Chapter 1 Daniel, Brian, and the team discussed various updates
and progress made in the past few weeks, including fixes for memory issues,
improvements in shuffle partitions for file compaction, positional delete
support in Python, and the implementation of view metadata. They also
mentioned plans for engine integrations and the need for further
discussions on multi-table transactions.
- 10:29 Chapter 2 The team discussed the progress and next steps for
implementing multi-table commits and improving support for ordered read
from iceberg tables. They also discussed plans to make the Iceberg
community meetings more accessible by providing a public link instead of
requiring a subscription to the dev list.
- 21:03 Chapter 3 The team discussed the new features and improvements
included in the ODOP release, such as positional leads, SQL style filters,
and performance enhancements. They also explored the possibility of adding
Bloom filters to the manifest file to improve point query performance
- 31:54 Chapter 4 The discussion revolved around the potential use of
Parquet Bloom filters to improve performance and reduce costs. They also
discussed the possibility of introducing priority-based commit and the
challenges associated with it.
- 42:00 Chapter 5 The team discussed the possibility of implementing a
rollback feature in the catalog and snapshot producer. They also talked
about prioritizing documentation updates and planning to add examples and
tutorials in the future.

Thanks! See you all at the next sync!

Moving from the Iceberg Sync Google group to Community Calendar.

2023-07-02 Thread Brian Olsen

Hey Iceberg Nation!

Due to technical difficulties of folks not receiving invitations from the
Iceberg Sync Google Group  we
will move to sharing the link through a public calendar posted on the Apace
Iceberg site .

As mentioned on the website, there are two calendars you can subscribe
to. Iceberg Dev Events is the calendar that contains events such as the
triweekly Iceberg sync. It aims to discuss the project roadmap and how to
implement features. The Iceberg Community Events calendar contains events
such as conferences and meetups, aimed to educate and inspire Iceberg
users. The former is aimed at contributors while the latter is aimed at
users. Feel free to subscribe to both.

Please test that you can find the longstanding Google meeting on the
calendar before the next meeting and post any questions or concerns you
have to this thread.

Bits, over and out

Rust support support

2023-07-06 Thread Brian Olsen

Hey Iceberg Nation,

>From Reddit:
https://www.reddit.com/r/dataengineering/comments/14rcaj9/iceberg_won_the_table_format_war_but_not_in_the/jqva6if/

What I want most right now is a native library for Iceberg format, that can
be used to read/write Iceberg table from multiple programming languages,
similar to how delta-rs supports accessing Delta Lake format in
Rust/Python/etc...

https://github.com/apache/iceberg/issues/5122

This issue looks at both C++ and Rust, I’m not sure how much demand there
is for C++ but there’s already a library for Iceberg Rust with 22 stars and
7 forks.

https://github.com/JanKaul/iceberg-rust

Have we seen more demand for this? Delta Lake supports it but that alone
isn’t a great reason. At what point would it make sense to get this adopted
by the project? Would this involve gathering evidence, presenting it, and
taking a vote?

Thanks

Re: Rust support support

2023-07-13 Thread Brian Olsen

Jan,

Sorry for the delay in reply. I would love to help this process along to
reach more people using Iceberg. I am just a developer advocate trying to
match people together. Let me point this to a few of the committers and see
what they think.

Thanks :)

On Mon, Jul 10, 2023 at 3:56 AM Jan Kaul 
wrote:

> Hi Brian,
>
> thanks for starting the discussion around a native library for iceberg.
> I'm Jan the owner of the iceberg-rust repository. Recently, another
> project started the implementation of a iceberg rust library
> (https://github.com/icelake-io/icelake). I think this is a great timing
> to start a rust project in the official apache iceberg repository that
> we can all work on together.
>
> I have also been experimenting with ways to combine a C++ and a rust
> library. There a ways to create a rust library that could be used from
> C++ but these would bring minor inconveniences on the rust side. I don't
> know how many people would want to use a rust SDK from C++ and therefore
> I'm not sure if this is worth the effort.
>
> Another point I was thinking about is whether to include the support for
> the WASM component model in the rust library. This way all wasm
> compatible programming languages would have access to the iceberg SDK.
>
> I'm curious what your opinions are on these topics.
>
> Best wiches,
>
> Jan
>
> On 06.07.23 12:08, Brian Olsen wrote:
> > Hey Iceberg Nation,
> >
> > From Reddit:
> >
> https://www.reddit.com/r/dataengineering/comments/14rcaj9/iceberg_won_the_table_format_war_but_not_in_the/jqva6if/
> >
> > What I want most right now is a native library for Iceberg format,
> > that can be used to read/write Iceberg table from multiple programming
> > languages, similar to how delta-rs supports accessing Delta Lake
> > format in Rust/Python/etc...
> >
> > https://github.com/apache/iceberg/issues/5122
> >
> > This issue looks at both C++ and Rust, I’m not sure how much demand
> > there is for C++ but there’s already a library for Iceberg Rust with
> > 22 stars and 7 forks.
> >
> > https://github.com/JanKaul/iceberg-rust
> >
> > Have we seen more demand for this? Delta Lake supports it but that
> > alone isn’t a great reason. At what point would it make sense to get
> > this adopted by the project? Would this involve gathering evidence,
> > presenting it, and taking a vote?
> >
> > Thanks
> >
>

Re: Iceberg docs pull requests

2023-07-14 Thread Brian Olsen

I’m planning to propose a bunch of changes here. If you’re interested in
discussing how you can help. Please join the Iceberg Slack and the channel
#documentation. I would like to distribute work once some suggestions are
accepted. I’m out of office today and early next week, but I’ll start work
on these next week.

On Fri, Jul 14, 2023 at 10:22 AM Russell Spitzer 
wrote:

> One of the issues is we kind of have a dual repo doc process. Most doc
> changes that are versioned are made in the main oss apache repo and they
> are copied over when Iceberg is released. So changes against the doc repo
> are only for fixing past docs or non-versioned pages.
>
> I'll take a quick look at the outstanding PR's but most of them seem to be
> awaiting changes
>
> On Fri, Jul 14, 2023 at 10:05 AM Zsolt Miskolczi <
> zsolt.miskol...@gmail.com> wrote:
>
>> Hey team!
>>
>> First of all, thank you for giving us Iceberg. I really love the concept
>> of how Iceberg structures the files and I'm pretty sure it is the feature
>> of storing files in big data.
>>
>> However, the community is pretty active about developing Iceberg, I have
>> a feeling that documentation doesn't get the attention that it deserves.
>>
>> I saw the open pull requests in iceberg-docs
>>  and I couldn't not notice
>> that there are pull requests that are more than a month old and didn't
>> receive any review at all.
>>
>> Can I draw your attention to the community about documentation?
>>
>> Thank you,
>> Zsolt Miskolczi
>>
>>

Re: [PROPOSAL] Preparing first Apache Iceberg Summit

2023-07-19 Thread Brian Olsen

Hey JB,

I would love to hop on a call and discuss how I can help as well. I've
planned a couple of these before. :)

On Wed, Jul 19, 2023 at 10:54 AM Russell Spitzer 
wrote:

> I would love to be involved if possible. I'm a bit short on time though
> but can definitely contribute async time to planning.
>
> On Wed, Jul 19, 2023 at 9:35 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi guys,
>>
>> Following the previous email about Apache Iceberg Summit, please find
>> a document introducing the summit organization:
>>
>>
>> https://docs.google.com/presentation/d/1iy2-WdVQYTwJOrwi7pFYh_x9xHNuGV5lT1yK1g194To/edit?usp=sharing
>>
>> I'm kindly doing a Call For Action: anyone interested to help in the
>> organization and participate to the committees, please let me know.
>> I would like to schedule a meeting with all interested parties.
>>
>> Thanks !
>>
>> Regards
>> JB
>>
>> On Wed, Jul 5, 2023 at 4:37 PM Jean-Baptiste Onofré 
>> wrote:
>> >
>> > Hi everyone,
>> >
>> > I started a discussion on the private mailing list, and, as there are
>> > no objections from the PMC members, I'm moving the thread to the dev
>> > mailing list.
>> >
>> > I propose to organize the first Apache Iceberg Summit \o/
>> >
>> > For the format, I think the best option is a virtual event with a mix
>> of:
>> > 1. Dev community talks: architecture, roadmap, features, use in
>> "products", ...
>> > 2. User community talks: companies could present their use cases, best
>> > practices, ...
>> >
>> > In terms of organization:
>> > 1. no objection so far from the PMC members to use Apache Iceberg
>> > Summit name. If it works for everyone, I will send a message to the
>> > Apache Publicity & Marketing to get their OK for the event.
>> >  2. create two committees:
>> >   2.1. the Sponsoring Committee gathering companies/organizations
>> > wanting to sponsor the event
>> >   2.2. the Program Committee gathers folks from the Iceberg community
>> > (PMC/committers/contributors) to select talks.
>> >
>> > My company (Dremio) will “host” the event - i.e., provide funding, a
>> > conference platform, sponsor logistics, speaker training, slide
>> > design, etc..
>> >
>> > In terms of dates, as CommunityOverCode Con NA will be in October, I
>> > think January 2024 would work: it gives us time to organize smoothly,
>> > promote the event, and not in a rush.
>> >
>> > I propose:
>> > 1. to create the #summit channel on Iceberg Slack.
>> > 2. I will share a preparation document with a plan proposal.
>> >
>> > Thoughts ?
>> >
>> > Regards
>> > JB
>>
>

Re: Broken slack invite

2023-07-24 Thread Brian Olsen

Did you pull it from the invite link on the community page? I wasn’t aware
that one existed and only updated it at the top of the page. We definitely
need to invest into variables when we update the docs. Until then, please
use the link below or click the slack icon in the top right of the screen
on the iceberg site.

On Mon, Jul 24, 2023 at 12:51 PM Russell Spitzer 
wrote:

>
> https://github.com/apache/iceberg-docs/commit/a42abbf9e7cda62ac4d94943599e840d4342d6c5
>
> It was just updated but I don't think the docs have been republished yet
>
>
> On Jul 23, 2023, at 7:34 AM, Bruno Murino  wrote:
>
> Hi,
>
> I’m trying to access the slack workspace for Apache Iceberg but I think
> the link is broken.
>
> Can I be added please?
>
> Cheers,
>
> Bruno Murino
>
>
>

Proposal to fix the docs - this time it'll be different

2023-07-26 Thread Brian Olsen

Hey all,

I have some proposals I'd like to make to fixing the docs. I would want to
do this in two phases.

The first phase I'm proposing that we locate all the documentation
(reference docs, website, and pyIceberg) back into the apache/iceberg
repository. I explain my reasoning in the attached document. This phase
would also update us from Hugo to MkDocs but keep all the content the same.

The second phase, is focused on iteratively building out the content that
we've marked missing in some the proposal that Sam R. created along with a
recent community member, Mahfuza. We will also restructure the content to
following the diátaxis method (https://diataxis.fr/).

https://docs.google.com/document/d/1WJXzcwC6isfoywcLY2lZ9gZN6i0JU1I2SIZmCkumbZc/edit#heading=h.gli9mc2ghfz1

Let me know what you think and bring on the questions and criticisms
please! :)

Bits

Re: Proposal to fix the docs - this time it'll be different

2023-07-27 Thread Brian Olsen

Thanks Fokko,

Yeah, I think tío address that we would need to switch to a tagging that
prefixes the different project name as a namespace within the tags space
(e.g. pyIceberg-0.4.0, rust-0.0.1, etc…). But certainly this would result
in an explosion of tags as we continue to introduce more projects. I’m not
sure if this makes it difficult to find things as long as you start to
search the prefix in GitHub it should be easy enough to find. Has anyone
else worked on a project where this type of tagging is applied? Are their
any performance, searching, or other implications we are missing?

Bits

On Thu, Jul 27, 2023 at 4:18 AM Fokko Driesprong  wrote:

> Hey Brian,
>
> Thanks for raising this. As a release manager, I can confirm that the
> current structure is confusing, and I can also see the community
> struggling with this because they are willing to contribute to the docs,
> but cannot always find the place where to do this. I think the complexity
> of the current website mostly comes from the versioned docs. It would be
> great if we can find a way to make this easier. Instead of using the
> branches, we could also use the release tags and build the docs for those
> versions.
>
> I think switching to mkdocs-material is a great idea. We currently also
> use this for PyIceberg, and it works really well. My main concern is around
> merging everything together. Should we combine Java and Python in the same
> documentation? They have a different versioning scheme, so that would
> create a matrix of versions. Go and Rust
> <https://github.com/apache/iceberg-rust/issues/8> is also in the making,
> so that would explode at some point.
>
> Cheers, Fokko
>
> Ps. Currently, PyIceberg uses the gh-pages branch for publishing the docs
> <https://github.com/apache/iceberg/tree/gh-pages>.
>
>
> Op do 27 jul 2023 om 00:04 schreef Brian Olsen :
>
>> Hey all,
>>
>> I have some proposals I'd like to make to fixing the docs. I would want
>> to do this in two phases.
>>
>> The first phase I'm proposing that we locate all the documentation
>> (reference docs, website, and pyIceberg) back into the apache/iceberg
>> repository. I explain my reasoning in the attached document. This phase
>> would also update us from Hugo to MkDocs but keep all the content the same.
>>
>> The second phase, is focused on iteratively building out the content that
>> we've marked missing in some the proposal that Sam R. created along with a
>> recent community member, Mahfuza. We will also restructure the content to
>> following the diátaxis method (https://diataxis.fr/).
>>
>>
>> https://docs.google.com/document/d/1WJXzcwC6isfoywcLY2lZ9gZN6i0JU1I2SIZmCkumbZc/edit#heading=h.gli9mc2ghfz1
>>
>> Let me know what you think and bring on the questions and criticisms
>> please! :)
>>
>> Bits
>>
>

Re: Location of rust repo

2023-08-08 Thread Brian Olsen

Hey Rust folks, tomorrow is the Iceberg Community Sync
. We will
be discussing locations of the Iceberg client API locations
. While I
think everyone is okay with Rust being outside of the main repo, a few of
us voiced the consistency concern with go and pyIceberg. I'd like to have a
short discussion about going in one direction or the other for all client
projects. The one clear disadvantage is that changes affecting both the
client and the server will be spread across multiple PRs. This should be
rare enough with the clients that it shouldn't matter but let's discuss
more tomorrow if you all can attend. Thanks!

On Wed, Jul 19, 2023 at 12:49 PM Russell Spitzer 
wrote:

> +1, If the folks working on Rust want it in the main repo I have no issues
> with that but it should be their choice :)
>
> On Wed, Jul 19, 2023 at 12:47 PM Ryan Blue  wrote:
>
>> I don't have a strong opinion here. I'd probably lean toward having it in
>> the main repo to get more eyes on the PRs, but I think it's primarily up to
>> the people contributing to the project.
>>
>> On Wed, Jul 19, 2023 at 2:30 AM Jan Kaul 
>> wrote:
>>
>>> Hey all,
>>>
>>> we just had our first sync for the rust iceberg developers and it was
>>> great to talk to everyone.
>>>
>>> The most important point that came up was the location where the rust
>>> development should take place. The two options are either to have a
>>> separate "iceberg-rust" repository or to create a "rust" folder in the
>>> existing apache/iceberg repository.
>>>
>>> The benefits of a separate repository are separate CI, simpler merging
>>> of PRs and a more scalable solution if more languages are added.
>>>
>>> The benefits of a subfolder in the existing repository are more
>>> visibility, easier coordination with the java project and more feedback
>>> from the community.
>>>
>>> The developers currently working on the rust implementation slightly
>>> favor a separate repository but would be okay with using the existing
>>> repository.
>>>
>>>
>>> It would be great if you could share your opinions on the topic. Maybe
>>> this could also be a point for the community sync later today.
>>>
>>> Hope you're all doing well. Best wishes,
>>>
>>> Jan
>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

[PROPOSAL] Expose Apache Iceberg Slack data for hacking/education for the community

2023-08-08 Thread Brian Olsen

Hey Iceberg Nation,

I wanted to propose having the public Apache Iceberg Slack
 chat and user data for the community to
use as a public data source. I have a couple of specific use cases in mind
that I would like to use it for, hence what brought me to ask about it.

The main problem I want to address for the community is the lack of
persistence of the answers we're generating in Slack. Slack is on a free
version that only retains the last 60 days of valuable threads happening
there. Questions are repeatedly asked, and this takes up time for everyone
in the community to answer the same questions multiple times. If we publish
the public chat and user data (i.e. no emails or user info outside of
what's displayed in Slack), then we can address this in the following ways:

   1. We can use this as a getting started tutorial featuring pyIceberg is
   to pull this dataset into a python or SQL ecosystem to learn about Iceberg,
   but also to discover old conversations that no longer appear on Slack. We
   can also take the raw data and push it into a local chatbot for folks to
   ask questions locally, build analytics projects etc...
   2. For those that are less interested in building your own chatbot or
   data pipeline, once this data is available, Tabular could use it to build
   and maintain a Discourse Forum  (not to be
   confused with Discord). There are many reasons to add this on top of Slack,
   like persistence, discoverability via Google, curation and organization
   into wiki style to the point answers, and gamification, to make the goal
   that it's not just Tabular moderating this, but that the community takes
   over as they build trust similar to Stack Overflow. Of course, once we have
   the initial community working together there, we can use both Slack for
   faster messaging, and migrate specific valuable conversations to Discourse
   once it is done.
   3. Another idea, would be that we could also use the Discourse forum as
   one of the inputs to create some sort of chatbot experience, either in
   Slack or nested in the docs. This would likely outperform just directly
   training on Slack data as answers in Slack aren't verified and curated to
   the most concise form possible.
   4. The Slack and Tabular Discourse forum would be public to read, so
   this would allow for other companies in the space to build their own
   solutions.


The idea is that we would run a daily job that would export the Slack logs
to some public dumping ground (GitHub or something) to store this dataset.
Again, only public data that you could see if you signed up and logged into
Slack would be exposed.

How does this sound to everyone? Let me know if you have any questions or
other ideas!

Bits

Re: [PROPOSAL] Expose Apache Iceberg Slack data for hacking/education for the community

2023-08-08 Thread Brian Olsen

Good point, it looks like the main thing Slack's TOS
<https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Salesforce_MSA.pdf?_gl=1*1u5n6fj*_ga*MTU2MzM4Mjk5OC4xNjgyNTM4NjIz*_ga_QTJQME5M5D*MTY5MTUzMDE1Mi40Mi4xLjE2OTE1MzA4MzIuMjkuMC4w>
in
section 3.3 points us to Salesforce's External Facing Services Policy
<https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ExternalFacing_Services_Policy.pdf>
which
addresses is the consent for businesses under NDAs on public or shared
channels or private conversations or PII being exported without consent,
and a bunch of other clearly illegal stuff we're not doing.

I think since this data is public in the sense that anyone with the
publicly available invite can join and read/see display names, we are fine.
Slack has nothing in there about an PMC admin running an export to get
access to the data that's owned by the ASF. So I believe as long as we get
consent from the community and the PMC is okay with it, then we should be
fine from a legal standpoint as long as we don't export private information
like emails or private chats being included in this.

On Tue, Aug 8, 2023 at 4:53 PM Russell Spitzer 
wrote:

> I'm +1 as long as Slack TOS are ok with it. We already have full public
> archives of the mailing list and I see slack as just an extension of the
> mailing list.
>
> On Tue, Aug 8, 2023 at 4:18 PM Brian Olsen 
> wrote:
>
>> Hey Iceberg Nation,
>>
>> I wanted to propose having the public Apache Iceberg Slack
>> <https://apache-iceberg.slack.com/> chat and user data for the community
>> to use as a public data source. I have a couple of specific use cases in
>> mind that I would like to use it for, hence what brought me to ask about it.
>>
>> The main problem I want to address for the community is the lack of
>> persistence of the answers we're generating in Slack. Slack is on a free
>> version that only retains the last 60 days of valuable threads happening
>> there. Questions are repeatedly asked, and this takes up time for everyone
>> in the community to answer the same questions multiple times. If we publish
>> the public chat and user data (i.e. no emails or user info outside of
>> what's displayed in Slack), then we can address this in the following ways:
>>
>>1. We can use this as a getting started tutorial
>>featuring pyIceberg is to pull this dataset into a python or SQL ecosystem
>>to learn about Iceberg, but also to discover old conversations that no
>>longer appear on Slack. We can also take the raw data and push it into a
>>local chatbot for folks to ask questions locally, build analytics projects
>>etc...
>>2. For those that are less interested in building your own chatbot or
>>data pipeline, once this data is available, Tabular could use it to build
>>and maintain a Discourse Forum <https://discourse.org/> (not to be
>>confused with Discord). There are many reasons to add this on top of 
>> Slack,
>>like persistence, discoverability via Google, curation and organization
>>into wiki style to the point answers, and gamification, to make the goal
>>that it's not just Tabular moderating this, but that the community takes
>>over as they build trust similar to Stack Overflow. Of course, once we 
>> have
>>the initial community working together there, we can use both Slack for
>>faster messaging, and migrate specific valuable conversations to Discourse
>>once it is done.
>>3. Another idea, would be that we could also use the Discourse forum
>>as one of the inputs to create some sort of chatbot experience, either in
>>Slack or nested in the docs. This would likely outperform just directly
>>training on Slack data as answers in Slack aren't verified and curated to
>>the most concise form possible.
>>4. The Slack and Tabular Discourse forum would be public to read, so
>>this would allow for other companies in the space to build their own
>>solutions.
>>
>>
>> The idea is that we would run a daily job that would export the Slack
>> logs to some public dumping ground (GitHub or something) to store this
>> dataset. Again, only public data that you could see if you signed up and
>> logged into Slack would be exposed.
>>
>> How does this sound to everyone? Let me know if you have any questions or
>> other ideas!
>>
>> Bits
>>
>

Re: Discussion about the location of language clients

2023-08-10 Thread Brian Olsen

Renjie, you're amazing.

I think you summarized this better than I could, so thank you for that.

I'd like to pull in a user's feedback on Slack

FWIW, I’m personally a fan of separate repos for the client libraries.
> It keeps things more a bit more isolated (in a good way) and explorable
> (rather than overwhelming). GitHub search is a bit easier to use. And I
> think it generally lowers the bar to contributing. Independent versioning,
> and GitHub releases are a big win too, I think.
>

Right now, I don’t actually know where to find PyIceberg release notes.
> Would love to see release notes in the GitHub releases for them.



IMO, The most important measurement of success for choosing either of these
options is about making the contributor experience as smooth as possible.

Monorepo has the advantage of one place to look, all changes across
core/clients can be modeled in a single PR, and sharing resources. At
first, I considered managing the build to only be a problem for Iceberg
committers managing the build, but ultimately this is setting us up for a
longer build and running unnecessary infrastructure for unrelated tasks.
There is definitely ways that we can verify what parts of the code have
been changed and which code should be run, but it will not always be clear
or simple to know if we tested too much or not enough.

For that, I am also in the multi-repo camp (for clients). I think despite
having to manage different repos for each client, I generally consider the
work of each client to be independent of the work happening in the main
repo. In this view, it's possibly better that the work be independent and
seen on its own. The biggest win IMO is the intentional separation of
testing and deployment infrastructure. This will make for a better
experience when folks are contributing, testing, and looking for release
notes.

But I also really don't care as long as we do the same things across
clients. ;)

Bits


On Thu, Aug 10, 2023 at 2:38 AM Renjie Liu  wrote:

> Hi, all:
>
>
>
> In yesterday’s community sync we talked about the location of different
> language clients, and I think we all agree that there should be consistent
> behavior for these clients, but the decision has not been made yet. I want
> to continue the discussion here on the pros and cons of different sides:
> mono repo(all in one big repo) or multi small repos( one for each language
> client)
>
>
>
> To make things clear, currently we have four language libraries under
> development:
>
>
>
>1. Java: in main repo(https://github.com/apache/iceberg)
>2. Python: in main repo (https://github.com/apache/iceberg)
>3. Go: in main repo (https://github.com/apache/iceberg)
>4. Rust: in standalone repo (https://github.com/apache/iceberg-rust/)
>
>
>
> Currently I mainly contribute rust client and I can share the thoughts on
> why I voted for standalone repo:
>
>
>
>1. Easier project setup. Iceberg is a complex project with several
>components, and mainly written in java. As someone not quite familiar with
>this project structure, I feel easier to start a new one rather fitting
>into an existing one.
>2. Faster ci workflow. In early days of rust client’s development, we
>only need to touch rust related code. If we all live in one mono repo, it
>will trigger unnecessary ci to run for other components.
>
>
>
> I admit that these reasons may not stand for long term maintains, but it’s
> good for fast-paced development in early days.
>
>
>
> After reviewing some discussions on the web, I have a summary about the
> pros and cons of two sides:
>
>
>
> Mono Repo
>
>
>
> Pros
>
>- *Visibility and transparency*. It would be easier to follow
>progresses of all clients, and prs can have more reviews and attractions.
>- *Easier sharing of resources*. It would be easier to share resources
>for integration tests.
>
> Cons
>
>- *Increases complexity of project structure*. The project structure
>would be more complex when coupling different languages and toolchain 
> setup.
>- *Longer build/ci time.  *Unnecessary ci checks maybe triggered for
>small prs in different languages.
>
>
>
> Multi Repo
>
>
>
> Pros
>
>- *Simplifies project structure*. Different language may have
>toolchains and project setup, one repo for one language makes project
>structure easier to understand and follow.
>- *Independent versioning and releases*. Different language may have
>different versioning and releases process. It’s also possible in monorepo,
>but I guess it would be easier in standalone multi repo.
>- *Improved build/ci time*. No unnecessary ci checks will be triggered.
>
> Cons
>
>- *Difficult to track the overall progress. *Multi repos makes it
>harder to track what’s happening in different teams.
>- *Difficult to share common resources.* It maybe more difficult to
>share resources and do integration tests cross different languages.
>
>
>
>
>
> Welcome to

Re: [PROPOSAL] Expose Apache Iceberg Slack data for hacking/education for the community

2023-08-14 Thread Brian Olsen

We could use both if the ASF infra team wants to export that as well. I
think the more good conversations we can centralize and curate the better.

Especially having older outdated conversations point to new documentation
or newer relevant conversations as the software evolves.

The only way we’ll get there is by really building the community of
moderators to make this more adhoc knowledge graph.

On Mon, Aug 14, 2023 at 11:23 AM Ryan Blue  wrote:

> +1 for letting people use this dataset.
>
> Austin, we originally used the ASF Slack, but decided to move for other
> reasons (more channels, easier signup). And having history isn't actually
> helping, since the information that people need is still only available in
> Slack, which doesn't have very effective search.
>
> On Mon, Aug 14, 2023 at 8:23 AM Austin Bennett  wrote:
>
>> Had you considered using the ASF's slack?  That keeps history
>>
>> On Tue, Aug 8, 2023 at 3:05 PM Brian Olsen 
>> wrote:
>>
>>> Good point, it looks like the main thing Slack's TOS
>>> <https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Salesforce_MSA.pdf?_gl=1*1u5n6fj*_ga*MTU2MzM4Mjk5OC4xNjgyNTM4NjIz*_ga_QTJQME5M5D*MTY5MTUzMDE1Mi40Mi4xLjE2OTE1MzA4MzIuMjkuMC4w>
>>>  in
>>> section 3.3 points us to Salesforce's External Facing Services Policy
>>> <https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ExternalFacing_Services_Policy.pdf>
>>>  which
>>> addresses is the consent for businesses under NDAs on public or shared
>>> channels or private conversations or PII being exported without consent,
>>> and a bunch of other clearly illegal stuff we're not doing.
>>>
>>> I think since this data is public in the sense that anyone with the
>>> publicly available invite can join and read/see display names, we are fine.
>>> Slack has nothing in there about an PMC admin running an export to get
>>> access to the data that's owned by the ASF. So I believe as long as we get
>>> consent from the community and the PMC is okay with it, then we should be
>>> fine from a legal standpoint as long as we don't export private information
>>> like emails or private chats being included in this.
>>>
>>> On Tue, Aug 8, 2023 at 4:53 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> I'm +1 as long as Slack TOS are ok with it. We already have full public
>>>> archives of the mailing list and I see slack as just an extension of the
>>>> mailing list.
>>>>
>>>> On Tue, Aug 8, 2023 at 4:18 PM Brian Olsen 
>>>> wrote:
>>>>
>>>>> Hey Iceberg Nation,
>>>>>
>>>>> I wanted to propose having the public Apache Iceberg Slack
>>>>> <https://apache-iceberg.slack.com/> chat and user data for the
>>>>> community to use as a public data source. I have a couple of specific use
>>>>> cases in mind that I would like to use it for, hence what brought me to 
>>>>> ask
>>>>> about it.
>>>>>
>>>>> The main problem I want to address for the community is the lack of
>>>>> persistence of the answers we're generating in Slack. Slack is on a free
>>>>> version that only retains the last 60 days of valuable threads happening
>>>>> there. Questions are repeatedly asked, and this takes up time for everyone
>>>>> in the community to answer the same questions multiple times. If we 
>>>>> publish
>>>>> the public chat and user data (i.e. no emails or user info outside of
>>>>> what's displayed in Slack), then we can address this in the following 
>>>>> ways:
>>>>>
>>>>>1. We can use this as a getting started tutorial
>>>>>featuring pyIceberg is to pull this dataset into a python or SQL 
>>>>> ecosystem
>>>>>to learn about Iceberg, but also to discover old conversations that no
>>>>>longer appear on Slack. We can also take the raw data and push it into 
>>>>> a
>>>>>local chatbot for folks to ask questions locally, build analytics 
>>>>> projects
>>>>>etc...
>>>>>2. For those that are less interested in building your own chatbot
>>>>>or data pipeline, once this data is available, Tabular could use it to
>>>>>build and maintain a Discourse Forum <https://discourse

Re: [PROPOSAL] Preparing first Apache Iceberg Summit

2023-08-23 Thread Brian Olsen

Out of curiosity, is anyone strongly opposed to doing antics like this for
summits?

https://youtube.com/playlist?list=PLFnr63che7wYFsknFAqisURvfm96rW0Dr


On Mon, Aug 21, 2023 at 6:58 PM Matt Topol  wrote:

> I don't think I'll have much time to contribute to help, but I would
> absolutely help if possible.
>
> That said, I'll definitely want to give a talk / speak at this summit when
> it happens :)
>
> On Mon, Aug 21, 2023 at 1:38 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi guys,
>>
>> I'm back from vacation and I'm resuming the work on the Iceberg Summit
>> proposal doc. I will share the doc asap.
>>
>> Regards
>> JB
>>
>> On Wed, Jul 5, 2023 at 4:37 PM Jean-Baptiste Onofré 
>> wrote:
>> >
>> > Hi everyone,
>> >
>> > I started a discussion on the private mailing list, and, as there are
>> > no objections from the PMC members, I'm moving the thread to the dev
>> > mailing list.
>> >
>> > I propose to organize the first Apache Iceberg Summit \o/
>> >
>> > For the format, I think the best option is a virtual event with a mix
>> of:
>> > 1. Dev community talks: architecture, roadmap, features, use in
>> "products", ...
>> > 2. User community talks: companies could present their use cases, best
>> > practices, ...
>> >
>> > In terms of organization:
>> > 1. no objection so far from the PMC members to use Apache Iceberg
>> > Summit name. If it works for everyone, I will send a message to the
>> > Apache Publicity & Marketing to get their OK for the event.
>> >  2. create two committees:
>> >   2.1. the Sponsoring Committee gathering companies/organizations
>> > wanting to sponsor the event
>> >   2.2. the Program Committee gathers folks from the Iceberg community
>> > (PMC/committers/contributors) to select talks.
>> >
>> > My company (Dremio) will “host” the event - i.e., provide funding, a
>> > conference platform, sponsor logistics, speaker training, slide
>> > design, etc..
>> >
>> > In terms of dates, as CommunityOverCode Con NA will be in October, I
>> > think January 2024 would work: it gives us time to organize smoothly,
>> > promote the event, and not in a rush.
>> >
>> > I propose:
>> > 1. to create the #summit channel on Iceberg Slack.
>> > 2. I will share a preparation document with a plan proposal.
>> >
>> > Thoughts ?
>> >
>> > Regards
>> > JB
>>
>

Re: September board report

2023-09-08 Thread Brian Olsen

Hey JB, let me know if you need any help here.

On Thu, Sep 7, 2023 at 11:01 PM Jean-Baptiste Onofré 
wrote:

> Hi Ryan
>
> It looks good to me. About the conference (summit/meetup), I will add
> as soon as we have concrete plans (still working on the Summit
> proposal doc).
>
> Thanks !
>
> Regards
> JB
>
> On Thu, Sep 7, 2023 at 6:30 PM Ryan Blue  wrote:
> >
> > Hi everyone,
> >
> > Here’s my draft for the September Iceberg board report. Let me know if
> you’d like to add anything!
> >
> > I know that JB wanted to add conference talks last time, but I’m not
> aware of any that have happened this quarter. If you’ve given a talk
> recently, please let me know!
> >
> > Ryan
> >
> > Description:
> >
> > Apache Iceberg is a table format for huge analytic datasets that is
> designed
> > for high performance and ease of use.
> >
> > Project Status:
> >
> > Current project status: Ongoing
> > Issues for the board: none
> >
> > Membership Data:
> >
> > Apache Iceberg was founded 2020-05-19 (3 years ago)
> > There are currently 24 committers and 16 PMC members in this project.
> > The Committer-to-PMC ratio is 3:2.
> >
> > Community changes, past quarter:
> >
> > No new PMC members. Last addition was Szehon Ho on 2023-04-20.
> > No new committers. Last addition was Amogh Jahagirdar on 2023-04-25.
> >
> > Project Activity:
> >
> > Releases:
> >
> > PyIcberg 0.4.0 was released on 2023-07-23
> > 1.3.1 was released on 2023-07-25
> >
> > Java:
> >
> > Preparing for a 1.4.0 release in Sept/Oct
> > Added dependency bundles for AWS, GCP, and Azure
> > Added Azure FileIO implementation
> > Added API for multi-table commits
> > Performance optimizations for delete file scan planning
> > Spark: Implemented adaptive split sizing
> > Spark: Implemented function pushdown in v2 expressions
> > Flink: Added bucketing only key-by strategy
> > Build: Updated to Gradle version catalog
> > Making progress on the reference implementation of common views
> > Continuing work on table encryption
> >
> > Python:
> >
> > 0.5.0 rc1 vote is under way
> > Added support for serverless environments
> > Implemented schema evolution
> > Moved to Pydantic v2
> > Added support for positional deletes
> > Substantially improved Avro read performance
> > Added conversion from Parquet to Iceberg schemas
> > Added support for FSSpec and HDFS data
> > Added SQL filter parsing
> >
> > Rust:
> >
> > Created a repository for the Rust implementation, iceberg-rust
> > 25 PRs merged
> > Implemented base table metadata (e.g., types, transforms)
> > Implemented visitors for working with nested structures
> > Added Avro/Iceberg schema conversion
> > Added build tooling
> >
> > Go:
> >
> > Created a repository for the Go implementation, iceberg-go
> > Added schema and types
> >
> > Community Health:
> >
> > The largest development in the community is the addition of the Rust and
> Go
> > repositories, which is shown in the increase in code contributors this
> quarter.
> > The new implementations will also lead to new committers and PMC
> members. The
> > community has had good discussions about how manage contributions, to
> build
> > confidence in the implementations as well as to help new contributors
> become
> > familiar with the way the Apache community operates. (Along with ASF
> > requirements like license documentation.)
> >
> > Two community metrics show decreases. Dev list traffic tends to vary
> because of
> > how the community uses the dev list — that is, mostly for large design
> > discussions. The number of issues closed was also lower than normal and
> is not
> > expected to fluctuate. We will take a look and see what the difference
> is.
> >
> > --
> > Ryan Blue
> > Tabular
>

Re: i want to subscribe

2023-09-13 Thread Brian Olsen

Hey Ted,

You need to send this email to dev-subscr...@iceberg.apache.org, you sent
it to the main mailing list. Any future email will be to the dev@ address
though.

On Wed, Sep 13, 2023 at 9:36 AM ted lin  wrote:

> i want to subscribe
>

Re: Proposal to fix the docs - this time it'll be different

2023-09-28 Thread Brian Olsen

Hey All,

I know it's been a while but the first phase of the docs refactor has
landed. I think it's at a decent point for everyone to take a look. To be
clear, this is not going to replace the existing website yet, but get the
first large landing of new docs to provide the initial proof of concept for
the build and make incremental changes until we are comfortable making
the swap. Once this is in and 1.4.0 goes out, I'll have to retroactively
create tags for each prior version of the documentation. While that's
happening, we can have someone else work on the look and feel of the
website, to look closer to our current site.

https://github.com/apache/iceberg/pull/8659

Thanks! Let me know if you have any questions!

- Bits

On Thu, Jul 27, 2023 at 4:10 PM Szehon Ho  wrote:

> Hi
>
> I'm ok with putting things back in Iceberg repo, it gets more visbility
> on prs.  I guess it used to be a bit distracting, but now with more
> projects in Iceberg (pyiceberg, rust) we have to anyway use tags to filter
> through all the mails.
>
> Just wanted to +1 on Fokko/Ryan suggestion to avoid versioned doc
> directories, I had a lot of difficulties in this part doing the last
> release: https://github.com/apache/iceberg/issues/8151 , as did Anton
> when I consulted him offline.
>
> For me, replacing the 'latest' branch with a tag would be the biggest win
> as it caused me the most trouble.  If we can avoid versioned docs and use
> tags across the board, that would be even better, I do think all the
> versions are already tagged in Github on every release, if that is your
> question?
>
> Thanks,
> Szehon
>
> On Thu, Jul 27, 2023 at 2:31 AM Brian Olsen 
> wrote:
>
>> Thanks Fokko,
>>
>> Yeah, I think tío address that we would need to switch to a tagging that
>> prefixes the different project name as a namespace within the tags space
>> (e.g. pyIceberg-0.4.0, rust-0.0.1, etc…). But certainly this would result
>> in an explosion of tags as we continue to introduce more projects. I’m not
>> sure if this makes it difficult to find things as long as you start to
>> search the prefix in GitHub it should be easy enough to find. Has anyone
>> else worked on a project where this type of tagging is applied? Are their
>> any performance, searching, or other implications we are missing?
>>
>> Bits
>>
>> On Thu, Jul 27, 2023 at 4:18 AM Fokko Driesprong 
>> wrote:
>>
>>> Hey Brian,
>>>
>>> Thanks for raising this. As a release manager, I can confirm that the
>>> current structure is confusing, and I can also see the community
>>> struggling with this because they are willing to contribute to the docs,
>>> but cannot always find the place where to do this. I think the complexity
>>> of the current website mostly comes from the versioned docs. It would be
>>> great if we can find a way to make this easier. Instead of using the
>>> branches, we could also use the release tags and build the docs for those
>>> versions.
>>>
>>> I think switching to mkdocs-material is a great idea. We currently also
>>> use this for PyIceberg, and it works really well. My main concern is around
>>> merging everything together. Should we combine Java and Python in the same
>>> documentation? They have a different versioning scheme, so that would
>>> create a matrix of versions. Go and Rust
>>> <https://github.com/apache/iceberg-rust/issues/8> is also in the
>>> making, so that would explode at some point.
>>>
>>> Cheers, Fokko
>>>
>>> Ps. Currently, PyIceberg uses the gh-pages branch for publishing the
>>> docs <https://github.com/apache/iceberg/tree/gh-pages>.
>>>
>>>
>>> Op do 27 jul 2023 om 00:04 schreef Brian Olsen >> >:
>>>
>>>> Hey all,
>>>>
>>>> I have some proposals I'd like to make to fixing the docs. I would want
>>>> to do this in two phases.
>>>>
>>>> The first phase I'm proposing that we locate all the documentation
>>>> (reference docs, website, and pyIceberg) back into the apache/iceberg
>>>> repository. I explain my reasoning in the attached document. This phase
>>>> would also update us from Hugo to MkDocs but keep all the content the same.
>>>>
>>>> The second phase, is focused on iteratively building out the content
>>>> that we've marked missing in some the proposal that Sam R. created along
>>>> with a recent community member, Mahfuza. We will also restructure the
>>>> content to following the diátaxis method (https://diataxis.fr/).
>>>>
>>>>
>>>> https://docs.google.com/document/d/1WJXzcwC6isfoywcLY2lZ9gZN6i0JU1I2SIZmCkumbZc/edit#heading=h.gli9mc2ghfz1
>>>>
>>>> Let me know what you think and bring on the questions and criticisms
>>>> please! :)
>>>>
>>>> Bits
>>>>
>>>

Re: [DISCUSSION] Rename master branch as main for the main repository

2023-09-29 Thread Brian Olsen

+1000

Let me know how I can help!

On Fri, Sep 29, 2023 at 7:35 AM Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> The Apache CoC (https://www.apache.org/foundation/policies/conduct)
> especially contains section 5 about the wording we use. Several Apache
> projects renamed the master branch to the main branch (Apache Karaf,
> ActiveMQ, Airflow, ...).
> As we already use main for go, rust, and python repositories, I wonder
> (for consistency) if we should not rename master to main on the "main"
> repository.
>
> Apache INFRA can do this "smoothly" but we would have to do some changes:
> - update build.gradle
> - update README.md
> - update to GH Actions (in .github/workflows/*)
>
> Thoughts ?
>
> Regards
> JB
>

Re: Migration of PyIceberg to iceberg-python repository

2023-09-29 Thread Brian Olsen

+1

Great work Fokko!

Pucheng,

We still want to maintain all of the issues in the Python repository. The
one thing we will lose is pull requests, but I assume there are very few.

On Fri, Sep 29, 2023 at 10:34 AM Pucheng Yang 
wrote:

> Thanks for doing this. I wonder how do we deal with all the issues filed
> for python module but still open in iceberg repo?
>
> On Fri, Sep 29, 2023 at 7:55 AM Eduard Tudenhoefner 
> wrote:
>
>> +1 on moving to a separate repo and maintaining git history
>>
>> On Fri, Sep 29, 2023 at 3:30 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Awesome, it looks even better ;)
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On Fri, Sep 29, 2023 at 2:31 PM Fokko Driesprong 
>>> wrote:
>>> >
>>> > Hey Ajantha,
>>> >
>>> > That's a great suggestion. I've followed the steps and created a new
>>> PR here: https://github.com/apache/iceberg-python/pull/3
>>> >
>>> > The subdirectory-filter command moves a subdirectory to the root
>>> directory. This way I still had to add some files afterward (.github/*,
>>> .gitignore, etc.), these are in a separate commit. Please take a look.
>>> >
>>> > Thanks,
>>> >
>>> > Fokko
>>> >
>>> > Op vr 29 sep 2023 om 13:39 schreef Ajantha Bhat >> >:
>>> >>
>>> >> I think we are gonna lose the history of commits if we merge the
>>> above PR.
>>> >>
>>> >> There are ways to move the subfolder into a new repo by retaining
>>> commit history.
>>> >> For example:
>>> >> -
>>> https://medium.com/@ayushya/move-directory-from-one-repository-to-another-preserving-git-history-d210fa049d4b
>>> >> - https://gist.github.com/trongthanh/2779392
>>> >>
>>> >> Please give it a try.
>>> >>
>>> >> Thanks,
>>> >> Ajantha
>>> >>
>>> >> On Fri, Sep 29, 2023 at 4:55 PM Fokko Driesprong 
>>> wrote:
>>> >>>
>>> >>> Hey everyone 👋
>>> >>>
>>> >>> A while ago we discussed that Rust and Go are going into a separate
>>> repository:
>>> https://lists.apache.org/thread/4s02lmwf1kyrxxdpj3q9w2fqnxq2llbn
>>> >>>
>>> >>> Since we just did the PyIcerg 0.5.0 release, I think it is a good
>>> moment to migrate PyIceberg to iceberg-python as well:
>>> https://github.com/apache/iceberg-python/pull/2 I went over the PRs
>>> that are ready to merge and got them in. If there is anything missing,
>>> please let me know.
>>> >>>
>>> >>> I would suggest merging the PR and leaving the source code in the
>>> main repository for another week or so to make sure that we didn't miss
>>> anything.
>>> >>>
>>> >>> Since PyIceberg now also hosts the docs on the Github pages of the
>>> Iceberg repository, moving PyIceberg will also free up the Github pages for
>>> the migration of the docs back into the main repository.
>>> >>>
>>> >>> Let me know if there are any concerns.
>>> >>>
>>> >>> Kind regards,
>>> >>> Fokko Driesprong
>>>
>>

Re: Migration of PyIceberg to iceberg-python repository

2023-09-30 Thread Brian Olsen

This shouldn’t be too hard and can likely be a nightly build that occurs
with each client repository.

We’re already planning on doing the documentation using git submodule to
pull all the documentation under a single build in the central repo. We can
likely go the other direction to run client-core integration tests. I
prefer these go on the client end to avoid too much ci running on the core
repo. We have to also consider whatever we choose to do with Python client
we will also apply to go, Rust, and any future client. Happy to hear
alternatives though!

WDYT Fokko?



On Sat, Sep 30, 2023 at 7:12 AM Hussein Awala  wrote:

> +1
>
> I checked the discussion thread, and one of the motivations for this
> separation was to avoid triggering unrelated CI jobs after each change.
> However, I wonder if it isn't (and will not be) necessary to check the
> compatibility between the main repository and the client after each change.
> Otherwise, we will need to trigger the CI across the different repositories
> using the GHA API, not necessarily to block the PR, but just to give quick
> feedback and notification that something needs to be changed on the client
> side.
>
> On Fri, Sep 29, 2023 at 9:39 PM Brian Olsen 
> wrote:
>
>> +1
>>
>> Great work Fokko!
>>
>> Pucheng,
>>
>> We still want to maintain all of the issues in the Python repository. The
>> one thing we will lose is pull requests, but I assume there are very few.
>>
>> On Fri, Sep 29, 2023 at 10:34 AM Pucheng Yang 
>> wrote:
>>
>>> Thanks for doing this. I wonder how do we deal with all the issues filed
>>> for python module but still open in iceberg repo?
>>>
>>> On Fri, Sep 29, 2023 at 7:55 AM Eduard Tudenhoefner 
>>> wrote:
>>>
>>>> +1 on moving to a separate repo and maintaining git history
>>>>
>>>> On Fri, Sep 29, 2023 at 3:30 PM Jean-Baptiste Onofré 
>>>> wrote:
>>>>
>>>>> Awesome, it looks even better ;)
>>>>>
>>>>> Thanks !
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On Fri, Sep 29, 2023 at 2:31 PM Fokko Driesprong 
>>>>> wrote:
>>>>> >
>>>>> > Hey Ajantha,
>>>>> >
>>>>> > That's a great suggestion. I've followed the steps and created a new
>>>>> PR here: https://github.com/apache/iceberg-python/pull/3
>>>>> >
>>>>> > The subdirectory-filter command moves a subdirectory to the root
>>>>> directory. This way I still had to add some files afterward (.github/*,
>>>>> .gitignore, etc.), these are in a separate commit. Please take a look.
>>>>> >
>>>>> > Thanks,
>>>>> >
>>>>> > Fokko
>>>>> >
>>>>> > Op vr 29 sep 2023 om 13:39 schreef Ajantha Bhat <
>>>>> ajanthab...@gmail.com>:
>>>>> >>
>>>>> >> I think we are gonna lose the history of commits if we merge the
>>>>> above PR.
>>>>> >>
>>>>> >> There are ways to move the subfolder into a new repo by retaining
>>>>> commit history.
>>>>> >> For example:
>>>>> >> -
>>>>> https://medium.com/@ayushya/move-directory-from-one-repository-to-another-preserving-git-history-d210fa049d4b
>>>>> >> - https://gist.github.com/trongthanh/2779392
>>>>> >>
>>>>> >> Please give it a try.
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Ajantha
>>>>> >>
>>>>> >> On Fri, Sep 29, 2023 at 4:55 PM Fokko Driesprong 
>>>>> wrote:
>>>>> >>>
>>>>> >>> Hey everyone 👋
>>>>> >>>
>>>>> >>> A while ago we discussed that Rust and Go are going into a
>>>>> separate repository:
>>>>> https://lists.apache.org/thread/4s02lmwf1kyrxxdpj3q9w2fqnxq2llbn
>>>>> >>>
>>>>> >>> Since we just did the PyIcerg 0.5.0 release, I think it is a good
>>>>> moment to migrate PyIceberg to iceberg-python as well:
>>>>> https://github.com/apache/iceberg-python/pull/2 I went over the PRs
>>>>> that are ready to merge and got them in. If there is anything missing,
>>>>> please let me know.
>>>>> >>>
>>>>> >>> I would suggest merging the PR and leaving the source code in the
>>>>> main repository for another week or so to make sure that we didn't miss
>>>>> anything.
>>>>> >>>
>>>>> >>> Since PyIceberg now also hosts the docs on the Github pages of the
>>>>> Iceberg repository, moving PyIceberg will also free up the Github pages 
>>>>> for
>>>>> the migration of the docs back into the main repository.
>>>>> >>>
>>>>> >>> Let me know if there are any concerns.
>>>>> >>>
>>>>> >>> Kind regards,
>>>>> >>> Fokko Driesprong
>>>>>
>>>>

Re: [DISCUSSION] Rename master branch as main for the main repository

2023-10-02 Thread Brian Olsen

As with any of these changes, the one and only inescapable side-effect is
that users' local environments will not be able to be updated. GitHub has
otherwise made it very simple to rename branches to accommodate this use
case. https://github.com/github/renaming Any old references to master will
on the GitHub site itself will reroute to main.

It's a small annoyance to make the Iceberg community more inclusive. For
those that aren't aware of the why:
https://en.wikipedia.org/wiki/Master/slave_(technology)#Terminology_concerns
.

On Mon, Oct 2, 2023 at 4:34 PM Hussein Awala  wrote:

> +1
>
> On Mon, Oct 2, 2023 at 11:27 PM Anton Okolnychyi 
> wrote:
>
>> +1
>>
>> On 2023/10/02 20:12:37 Bryan Keller wrote:
>> > Hearty +1 from me
>> >
>> >
>> >
>> > > On Sep 29, 2023, at 5:37 AM, Brian Olsen 
>> wrote:
>> > >
>> > >
>> >
>> > > 
>> > >
>> > > +1000
>> > >
>> > >
>> > >
>> > >
>> > > Let me know how I can help!
>> > >
>> > >
>> > >
>> > >
>> > > On Fri, Sep 29, 2023 at 7:35 AM Jean-Baptiste Onofré
>> > > <[j...@nanthrax.net](mailto:j...@nanthrax.net)> wrote:
>> > >
>> > >
>> >
>> > >> Hi guys,
>> > >
>> > >  The Apache CoC (<https://www.apache.org/foundation/policies/conduct>)
>>
>> > >  especially contains section 5 about the wording we use. Several
>> Apache
>> > >  projects renamed the master branch to the main branch (Apache
>> Karaf,
>> > >  ActiveMQ, Airflow, ...).
>> > >  As we already use main for go, rust, and python repositories, I
>> wonder
>> > >  (for consistency) if we should not rename master to main on the
>> "main"
>> > >  repository.
>> > >
>> > >  Apache INFRA can do this "smoothly" but we would have to do some
>> changes:
>> > >  \- update build.gradle
>> > >  \- update README.md
>> > >  \- update to GH Actions (in .github/workflows/*)
>> > >
>> > >  Thoughts ?
>> > >
>> > >  Regards
>> > >  JB
>> > >
>> >
>> >
>>
>

Re: [DISCUSSION] Rename master branch as main for the main repository

2023-10-12 Thread Brian Olsen

But also, don’t forget to save any commits you have on local or your fork
for master! :)

Thanks for coordinating this JB!!

On Thu, Oct 12, 2023 at 8:42 AM Jean-Baptiste Onofré 
wrote:

> By the way, don't forget to update your local git repo:
>
> git fetch --all
> git checkout main
> git branch -D master
>
> And you are good to go :)
>
> Regards
> JB
>
> On Thu, Oct 12, 2023 at 3:37 PM Jean-Baptiste Onofré 
> wrote:
> >
> > Hi guys
> >
> > I'm pleased to announce that the master branch has been renamed to main.
> >
> > https://github.com/apache/iceberg
> >
> > You can see that all PRs are now based on main (it's completely
> transparent).
> >
> > We can now merge the corresponding PR
> > (https://github.com/apache/iceberg/pull/8722). I will work with the
> > team to merge it asap.
> >
> > Thanks !
> > Regards
> > JB
> >
> > On Fri, Sep 29, 2023 at 2:35 PM Jean-Baptiste Onofré 
> wrote:
> > >
> > > Hi guys,
> > >
> > > The Apache CoC (https://www.apache.org/foundation/policies/conduct)
> > > especially contains section 5 about the wording we use. Several Apache
> > > projects renamed the master branch to the main branch (Apache Karaf,
> > > ActiveMQ, Airflow, ...).
> > > As we already use main for go, rust, and python repositories, I wonder
> > > (for consistency) if we should not rename master to main on the "main"
> > > repository.
> > >
> > > Apache INFRA can do this "smoothly" but we would have to do some
> changes:
> > > - update build.gradle
> > > - update README.md
> > > - update to GH Actions (in .github/workflows/*)
> > >
> > > Thoughts ?
> > >
> > > Regards
> > > JB
>

Re: Feedback on Iceberg Materialized View Spec

2023-10-26 Thread Brian Olsen

Hey JB,

I totally agree we need a place to centralize this but I'm nit a huge fan
of all the lists we currently have going on the site. SSGs are just not an
accessible method of storing lists. ( roadmap, blogs, videos, etc..).

The roadmap is barely touched for this reason. I want to propose we move
roadmap to GitHub projects.

Likewise, I feel like somewhere on GitHub might be a better location for
this type of thing.

Maybe posting these in GitHub issues and adding a proposal label?

On Tue, Oct 24, 2023 at 9:28 AM Jean-Baptiste Onofré 
wrote:

> Hi Jan
>
> Thanks for the reminder. I will take a look.
>
> As proposed by Renjie a few days ago, it would be great to
> gather/store all document proposals in a central place.
>
> If there are no objections, I will prepare a PR for the website about
> that (with a space listing/linking all proposals).
>
> Regards
> JB
>
>
>
> On Tue, Oct 24, 2023 at 9:22 AM Jan Kaul 
> wrote:
> >
> > Hi all,
> >
> > I've created an issue to propose a design for a Materialized View Spec a
> while ago. After further discussion we reached a first draft for the spec.
> It would be great if you could have another look at the design and share
> your feedback.
> >
> > Here is the google doc:
> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing
> >
> > Thanks in advance,
> >
> > Jan
>

Re: Feedback on Iceberg Materialized View Spec

2023-10-26 Thread Brian Olsen

GitHub Discussions could be a solution that we should consider. We used it
on the Trino side but still have mixed results with it. On one hand,
there's a lot of overlap between creating Issues and Discussions. In fact,
GitHub allows you to migrate Issues that only involve discussing a topic,
or something that can't immediately be tied to any upcoming work to be a
discussion. This keeps the Issue backlog focused on actionable requests.

That said, Discussions can become difficult to maintain if no person or
body of people drives it. Of course, the community will drive it to some
degree, especially when it's new and shiny, but GitHub Discussions, much
like Slack, becomes a support channel that encourages the messy human
interactions that help us arrive at a solution. So the question is do we
want to open Discussions knowing that it may become a second support
channel compared to Slack? Would we want to use Discussions in place of
Slack so that there's still a single triage channel?

I personally lean towards keeping a single real-time "support-like" channel
in the community, otherwise, you will fragment the attention of the
community. Most of what we would need to support the centralization of
proposals can be accomplished with Issues. Slack still seems to be the
dominant interactive system of choice and where we are now so I wouldn't
suggest moving that. I do think this is worth a discussion at the next sync
so I'll add it.

In full transparency, Tabular is building an Iceberg-focused Discourse forum
<https://discourse.org/> (not to be confused with Discord
<https://discord.com/>) instance to solve the problem of centralizing
discussions in the community to wiki-style answers we can link to and
having dedicated content curators to those solutions. Think of it as an
Iceberg-specific Stack Overflow with lightened rules to allow more open
discussion. Adding GitHub discussions wouldn't collide with our goals as it
would become another signal that we could use to inform the answers on our
forum. It still comes back to the value given the cost for the community to
manage it.

I know I have a lot of thoughts around this and its because I've been down
this road before, but perhaps there's a nuance I'm not seeing yet.

On Thu, Oct 26, 2023 at 7:15 AM Jean-Baptiste Onofré 
wrote:

> Just to be clear: we can GH Discussions subjects template via
> .asf.yaml but we have to open a ticket to INFRA to enable it.
>
> Regards
> JB
>
> On Thu, Oct 26, 2023 at 1:56 PM Jean-Baptiste Onofré 
> wrote:
> >
> > Hi Brian
> >
> > I like the idea of GitHub. Why not enabling (in .asf.yml) GitHub
> > discussions ? A GitHub Discussion could be a good place to share the
> > doc and exchange both in the doc and in the discussion comments.
> >
> > Regards
> > JB
> >
> > On Thu, Oct 26, 2023 at 1:13 PM Brian Olsen 
> wrote:
> > >
> > > Hey JB,
> > >
> > > I totally agree we need a place to centralize this but I'm nit a huge
> fan of all the lists we currently have going on the site. SSGs are just not
> an accessible method of storing lists. ( roadmap, blogs, videos, etc..).
> > >
> > > The roadmap is barely touched for this reason. I want to propose we
> move roadmap to GitHub projects.
> > >
> > > Likewise, I feel like somewhere on GitHub might be a better location
> for this type of thing.
> > >
> > > Maybe posting these in GitHub issues and adding a proposal label?
> > >
> > > On Tue, Oct 24, 2023 at 9:28 AM Jean-Baptiste Onofré 
> wrote:
> > >>
> > >> Hi Jan
> > >>
> > >> Thanks for the reminder. I will take a look.
> > >>
> > >> As proposed by Renjie a few days ago, it would be great to
> > >> gather/store all document proposals in a central place.
> > >>
> > >> If there are no objections, I will prepare a PR for the website about
> > >> that (with a space listing/linking all proposals).
> > >>
> > >> Regards
> > >> JB
> > >>
> > >>
> > >>
> > >> On Tue, Oct 24, 2023 at 9:22 AM Jan Kaul 
> wrote:
> > >> >
> > >> > Hi all,
> > >> >
> > >> > I've created an issue to propose a design for a Materialized View
> Spec a while ago. After further discussion we reached a first draft for the
> spec. It would be great if you could have another look at the design and
> share your feedback.
> > >> >
> > >> > Here is the google doc:
> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing
> > >> >
> > >> > Thanks in advance,
> > >> >
> > >> > Jan
>

Re: Feedback on Iceberg Materialized View Spec

2023-10-26 Thread Brian Olsen

Yeah, unfortunately there's no way to limit the functionality to only
facilitate this. In fact, the product that gets closest to it is GitHub
Issues.

I believe putting the onus on developers deeply involved in the project
makes sense. Expecting users, especially newer users of a newer generation
will use an email list is unlikely, especially if they're in a discovery
mode and figuring out how to solve an issue. A lot of garnering adoption
from users is lowering every barrier to entry as well as lowering time to
that first hello world dopamine hit.

I'm middle millennial and even I find using email for discussion outside of
my mental model/preference but I also see the benefits.

On Thu, Oct 26, 2023 at 10:45 AM Jean-Baptiste Onofré 
wrote:

> The idea is really to "square" GH Discussion only to roadmap/design
> proposals.
>
> For "user support", more than Slack, I would love to see
> u...@iceberg.apache.org.
>
> So I would distinguish:
> - the design/spec proposals where we could use GH Discussions. If
> people use GH Discussion for support questions, then we can move to GH
> Issue or direct to the mailing list/slack.
> - the user "support" should be on user mailing list and/or Slack
>
> You have a valid point: GH Discussions could be hard to manage because
> most users will use it as a "support forum".
>
> My point is really:
> - we need a central space for design/spec proposals
> - it has to be on Iceberg community and visible for all
>
> Regards
> JB
>
> On Thu, Oct 26, 2023 at 5:30 PM Brian Olsen 
> wrote:
> >
> > GitHub Discussions could be a solution that we should consider. We used
> it on the Trino side but still have mixed results with it. On one hand,
> there's a lot of overlap between creating Issues and Discussions. In fact,
> GitHub allows you to migrate Issues that only involve discussing a topic,
> or something that can't immediately be tied to any upcoming work to be a
> discussion. This keeps the Issue backlog focused on actionable requests.
> >
> > That said, Discussions can become difficult to maintain if no person or
> body of people drives it. Of course, the community will drive it to some
> degree, especially when it's new and shiny, but GitHub Discussions, much
> like Slack, becomes a support channel that encourages the messy human
> interactions that help us arrive at a solution. So the question is do we
> want to open Discussions knowing that it may become a second support
> channel compared to Slack? Would we want to use Discussions in place of
> Slack so that there's still a single triage channel?
> >
> > I personally lean towards keeping a single real-time "support-like"
> channel in the community, otherwise, you will fragment the attention of the
> community. Most of what we would need to support the centralization of
> proposals can be accomplished with Issues. Slack still seems to be the
> dominant interactive system of choice and where we are now so I wouldn't
> suggest moving that. I do think this is worth a discussion at the next sync
> so I'll add it.
> >
> > In full transparency, Tabular is building an Iceberg-focused Discourse
> forum (not to be confused with Discord) instance to solve the problem of
> centralizing discussions in the community to wiki-style answers we can link
> to and having dedicated content curators to those solutions. Think of it as
> an Iceberg-specific Stack Overflow with lightened rules to allow more open
> discussion. Adding GitHub discussions wouldn't collide with our goals as it
> would become another signal that we could use to inform the answers on our
> forum. It still comes back to the value given the cost for the community to
> manage it.
> >
> > I know I have a lot of thoughts around this and its because I've been
> down this road before, but perhaps there's a nuance I'm not seeing yet.
> >
> > On Thu, Oct 26, 2023 at 7:15 AM Jean-Baptiste Onofré 
> wrote:
> >>
> >> Just to be clear: we can GH Discussions subjects template via
> >> .asf.yaml but we have to open a ticket to INFRA to enable it.
> >>
> >> Regards
> >> JB
> >>
> >> On Thu, Oct 26, 2023 at 1:56 PM Jean-Baptiste Onofré 
> wrote:
> >> >
> >> > Hi Brian
> >> >
> >> > I like the idea of GitHub. Why not enabling (in .asf.yml) GitHub
> >> > discussions ? A GitHub Discussion could be a good place to share the
> >> > doc and exchange both in the doc and in the discussion comments.
> >> >
> >> > Regards
> >> > JB
> >> >
> >> > On Th

Re: Feedback on Iceberg Materialized View Spec

2023-10-26 Thread Brian Olsen

Agreed, apologies to Jan :). JB, let's discuss this at the sync this Wed,
and after that we can create a new thread if needed.

On Thu, Oct 26, 2023 at 1:38 PM Daniel Weeks  wrote:

> JB and Brian,
>
> I think we should probably move this discussion to a discuss thread
> specifically for the topics you want to address.
>
> We've had a few instances now where the original intent of the thread is
> redirected to talk about other subjects.  I don't feel this is a good
> approach because, while it is on the apache mailing list, the topic of the
> thread doesn't reflect the content, so you don't get the right
> audience/level of engagement or buy-in.
>
> I'm not disagreeing with trying to improve how we communicate and track
> improvements/proposals/etc, but I think we should try to keep the thread on
> topic.
>
> Thanks,
> -Dan
>
> On Thu, Oct 26, 2023 at 9:26 AM Jean-Baptiste Onofré 
> wrote:
>
>> Oh, I don't say we have to provide a user mailing list. Personally, I
>> like mailing list mainly because we have https://lists.apache.org/
>> where we can browse and search on the mailing lists.
>> A lot of Apache projects are using Slack or Zulip, but in parallel of
>> mailing lists. As we say at Apache: "if it doesn't happen on the
>> mailing list, it never happens".
>> That said I would distinguish:
>> - for dev, obviously we can use Slack for discussion, community
>> meetings, etc, but we have to send main topics/discussions on the dev
>> mailing list.
>> - for user, I think Slack is good, but I like the user mailing list,
>> to track/search/async communication as well.
>>
>> That's another discussion anyway, let's focus on the design proposals
>> space: my understanding is that we want to have a space listing all
>> proposals, for review, tagged as "done" or "in progress". Right ?
>> I don't think a forum/stack overflow like would help here (it helps
>> for users, not for dev/technical/design proposals).
>>
>> At Apache Beam, we have a similar page as at Iceberg:
>> https://beam.apache.org/roadmap/ where you can click on roadmap items
>> for details (https://beam.apache.org/roadmap/portability/).
>> So, initially, I proposed to update
>> https://iceberg.apache.org/roadmap/ with proposals (status
>> "discussion").  As most of the proposals (all ?) come as Google Link,
>> we can change a bit the look'n feel of this page including the list of
>> proposals.
>>
>> That could be a first move, we can update later.
>>
>> Regards
>> JB
>>
>> On Thu, Oct 26, 2023 at 5:54 PM Brian Olsen 
>> wrote:
>> >
>> > Yeah, unfortunately there's no way to limit the functionality to only
>> facilitate this. In fact, the product that gets closest to it is GitHub
>> Issues.
>> >
>> > I believe putting the onus on developers deeply involved in the project
>> makes sense. Expecting users, especially newer users of a newer generation
>> will use an email list is unlikely, especially if they're in a discovery
>> mode and figuring out how to solve an issue. A lot of garnering adoption
>> from users is lowering every barrier to entry as well as lowering time to
>> that first hello world dopamine hit.
>> >
>> > I'm middle millennial and even I find using email for discussion
>> outside of my mental model/preference but I also see the benefits.
>> >
>> > On Thu, Oct 26, 2023 at 10:45 AM Jean-Baptiste Onofré 
>> wrote:
>> >>
>> >> The idea is really to "square" GH Discussion only to roadmap/design
>> proposals.
>> >>
>> >> For "user support", more than Slack, I would love to see
>> >> u...@iceberg.apache.org.
>> >>
>> >> So I would distinguish:
>> >> - the design/spec proposals where we could use GH Discussions. If
>> >> people use GH Discussion for support questions, then we can move to GH
>> >> Issue or direct to the mailing list/slack.
>> >> - the user "support" should be on user mailing list and/or Slack
>> >>
>> >> You have a valid point: GH Discussions could be hard to manage because
>> >> most users will use it as a "support forum".
>> >>
>> >> My point is really:
>> >> - we need a central space for design/spec proposals
>> >> - it has to be on Iceberg community and visible for all
>> >>
>> >> Regards
>> >> JB
>> >>
>> &

Re: Community Meeting Minutes ?

2023-10-26 Thread Brian Olsen

Thanks for the reminder here JB. I just created a list to follow for this
process so I don't forget. At some point, I'll add it to the documentation
so that anyone can run this over time. I will share out the last few
meeting minutes in their own threads now.

On Thu, Oct 12, 2023 at 9:03 AM Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> Thanks for the community meeting yesterday, it was super interesting
> and motivating :)
>
> As we say at Apache: "If it didn't happen on the mailing list, it
> never happened" :)
> In order to give a chance to anyone in the community to see the topics
> and participate, it would be great to share the meeting minutes on the
> mailing list.
>
> I know Brian did that in July. It would be great to do it "systematically".
>
> @Brian do you mind sharing the meeting minutes on the mailing list ?
> Do you need my help to complete/review ?
> Maybe we can add it on the website too ?
>
> Thanks !
> Regards
> JB
>

Meeting Minutes from 2023-07-19 Iceberg Sync

2023-10-26 Thread Brian Olsen

Hey Iceberg Nation,
Everyone is welcome to attend syncs. Subscribe to this calendar

to receive a notification. Note: This meeting note is backdated as I forgot
to post it here earlier. 2023-07-19 (Meeting Recording
 ⭕ ) Highlights - PyIceberg
0.4.0 is out - Python Avro reads are 18% faster - Python concurrency
updated for AWS Lambda - Added Avro writes to Python - Fixed Spark
deleteWhere with WAP branch - Added registerTable to REST catalog - FLIP-27
Flink source switched to JSON parser for FileScanTask Releases - Please
vote on 1.3.1 - Java 1.4.0 - Targeting August for RC - Anton volunteered to
RM - Distributed planning - Row-level operation updates: MoR schema
pruning, etc. - Dynamic pruning stretch goal, mainly targeting MoR - Python
0.5.0 Discussion - View API issues (
https://github.com/apache/iceberg/pull/7992) - Should Projections take in
schema vs spec? Are there issues evaluating filters, with Time Travel
because we use the wrong schema? ) Came up while looking at this issue:
https://github.com/apache/iceberg/issues/7774 - Gradle version catalog
support - Applying spotless for scala code - Add Golang Iceberg to
Repo? AI-generated
chapter summaries: 0:00 
Chapter 1 The team discussed updates and progress on both the Python and
Java sides, including new features, performance improvements, and upcoming
releases. They also talked about the UAPI and the need to deprecate and
move certain interfaces. 10:40
 Chapter 2 The team
discussed the issue of generated classes appearing in the API package and
decided to break those classes and improve the generation process in the
future. They also discussed the problem of projections binding expressions
to the schema and agreed that passing the schema to the projections would
be a better solution. 21:37
 Chapter 3 Eduard
raised awareness about updating the dependency versioning plugin and
ensuring compatibility with Dependable. Anton expressed concerns about
applying spotless for Scala code due to differences with Spark, but agreed
to revisit the topic once Spark 3.5 is released. Matt proposed a Golang
implementation of iceberg and discussed the possibility of integrating it
into the main repository, with separate versioning and considerations for
release scripts and CI. 31:52
 Chapter 4 Matt and
Steven discussed the process of moving the code into the foundation,
including licensing and practical issues. They decided to start small PRs
to get more eyes on the code and build understanding, with Jacob offering
to assist. 42:10 
Chapter 5 Matt and Rusty discussed the need for a common representation of
tasks in Arrow and the desire to create a substrate plan for iceberg scans
with pushdown and deletes. They aimed to simplify the integration of
different languages and make querying iceberg tables more efficient. 51:29
 Chapter 6 Matt,
Fokko, and others discussed the benefits of representing plans as substrate
plans and the need for correct column projection in Arrow. They also
mentioned the possibility of opening an issue to coordinate on implementing
iceberg column resolution in C++.

Meeting Minutes from 2023-08-30 Iceberg Sync

2023-10-26 Thread Brian Olsen

Hey Iceberg Nation,
Everyone is welcome to attend syncs. Subscribe to this calendar

to receive a notification. Note: This meeting note is backdated as I forgot
to post it here earlier.
2023-08-30 (Meeting Recording 
⭕ )

   -

   Highlights
   -

  Java: Flink sink adds custom partitioner to better distribute traffic
  for bucket partitioned tables
   (Thanks, Sergio!)
  -

  Java: AWS, GCP, and Azure bundles (Thanks, Bryan!)
  -

  Java: Azure FileIO (Thanks, Bryan!)
  -

  Java: Delete file in job planning optimizations (Thanks, Anton!)
  -

  Python: Moved to Pydantic v2 (Thanks, Fokko!)
  -

  Java: Fixed branches with empty tables (Thanks, ConeyLiu!)
  -

  Rust: Merged TableMetadata (including (de)serialization), (Thanks,
  Jan!)
  -

  Go: Schema and types (Thanks, Matt!)
  -

   Releases
   -

  PyIceberg 0.5.0
  
  -

 Blockers
 -

Fixing schema evolution (#8374
)
-

  Java 1.4.0 
  -

 Blockers
 -

Fixing history with lazy snapshot loading
-

V2 tables by default
-

Spark distributed planning
-

Default to zstd
-

 Timeline: End of next week
 -

Milestone: https://github.com/apache/iceberg/milestone/35
-

   Discussion
   -

  Owned Table Location
  
https://docs.google.com/document/d/1pTJPQaHwyO0NFlLcHIrXq4gBazJmAyPnigmOPMbBRR0/edit?usp=sharing
  (Szehon)
  -

  Multi-arg transform:
  
https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit?usp=sharing
  (advancedxy)
  -

  Iceberg Table Portability - Link
  

  -

  Nanosecond timestamp & timestamptz - sufficient consensus, next steps?

AI-generated chapter summaries: 0:00
 Chapter 1 The team
discussed various updates and improvements, including the addition of a
custom partitioner for better data balancing, the creation of bundles for
easier use of cloud services, optimizations in the delete file job planning
path, and progress in the Rust and Go communities. 10:11
 Chapter 2 The team
discussed various issues and updates related to the Java 1.4 release,
including the progress on resolving dependencies, the appointment of Anton
as the release manager, and the inclusion of features like distributed
planning in Spark and the use of C standard for new tables. They also
discussed the need for fixing lazy snapshot loading and the challenges of
table location ownership to prevent accidental destruction of tables. 21:02
 Chapter 3 The team
discussed the issue of orphaned files and the challenges of table and
location ownership. They explored different modes and approaches to address
the problem, including checking for location existence, assigning table
locations based on table names, and implementing a global orphan file
cleanup at the administrator level. 29:58
 Chapter 4 The
discussion revolved around the ownership and sharing of table locations in
a catalog. They considered different approaches and debated whether to use
table properties or a separate list of owned locations in the table
metadata bundle. 39:12 
Chapter 5 Anton, Xianqing, and others discussed the proposal to include
multi-argument transformers in the documentation. They explored the
challenges and benefits of this change, including the need for expression
API modifications and the possibility of adding additional information for
normalization values in transforms. 49:44
 Chapter 6 - Anton
discussed the possibility of supporting additional files and range
partitioning for better performance. The team agreed to consider
implementing custom transforms and table portability in future discussions.

Meeting Minutes from 2023-09-20 Iceberg Sync

2023-10-26 Thread Brian Olsen

Hey Iceberg Nation,
Everyone is welcome to attend syncs. Subscribe to this calendar

to receive a notification. Note: This meeting note is backdated as I forgot
to post it here earlier.
2023-09-20 (Meeting Recording 
⭕ )

   -

   Highlights
   -

  PyIceberg 0.5.0 has been released 🎉🎉🎉 Thanks everyone for
  contributing!
  -

  FileIO has been implemented for iceberg-rust, and the catalog is
  almost there
  -

  Spark 3.5 support was added (Thanks, Anton!)
  -

  Added support for distributed planning in Spark (Thanks, Anton!)
  -

  Spark will push down system.iceberg functions to scans (Thanks,
  ConeyLiu!)
  -

  Added AES GCM encryption and decryption streams (Thanks, Gidon!)
  -

  Added strict metadata cleanup (Thanks, Amogh!)
  -

  Vectorized reads for MoR DELETE, UPDATE, MERGE plans
  -

   Releases
   -

  Iceberg 1.4.0 – milestone with all pending PRs
  
  -

 Spark updates – advisory partition size (PR pending)
 -

 Spark versions: 3.1 to 3.5?
 -

 Strict metadata cleanup - yes
 -

 Use Zstd by default (#8593
 )
 -

 Flink credential refresh issue (#8555
 )
 -

   Discussion
   -

  Parquet metrics problem from Trino
  -

  Defaulting to ResolvingFileIO
  
  -

  Discrepancies around null-counts
   for lists, maps and
  structs
  -

  Proposal: Introduce deletion vector file to reduce write amplification
  

  -

  Nanosecond timestamp & timestamptz - sufficient consensus, next steps?
  -

  Adding an explicit validation API to DeleteFiles
   which validates
  the files exist when committing the delete.
  -

  Partition Stats Spec
  -

  Encryption update


AI-generated chapter summaries: 0:00
 Chapter 1 The team
discussed the progress and updates in various implementations, including
support for HDFS, modifications to schemas, and the addition of Spark 3.5
support. They also mentioned advancements in function pushdown, metadata
encryption, and vectorized reads for merge commands. 11:36
 Chapter 2 Anton and
Brajesh discussed the need to change the behavior of Spark versions and the
default file sizes in Iceberg. They also considered reducing the number of
supported Spark versions and highlighted the work on strict metadata
cleanup by MOGS. 17:23 
Chapter 3 The team discussed the implementation of Strict Cleanup in Hive
to prevent file corruption and agreed to turn it on by default. They also
discussed using Z standard by default for new tables and made changes to
the table metadata object and the rest catalog to accommodate this. 23:08
 Chapter 4 The team
discussed an issue with incorrect iceberg metadata coming from Parquet
files, where min and max values were not being truncated properly. They
concluded that there was limited action they could take on the iceberg side
and that the underlying issue stemmed from Parquet stats not adhering to
the iceberg spec. 29:11
 Chapter 5 The team
discussed various issues related to metadata and file IO implementation.
They considered fixing bugs in iceberg, exploring defaulting to resolving
FileIO, and addressing discrepancies in null counts for lists, maps, and
structs. 35:00 
Chapter 6 The team discussed the need to differentiate between null counts
in nested fields and the top-level field, and decided to drop null counts
for arrays but keep them for structs. They also considered introducing a
delete vector to reduce write amplification, but found the proposal unclear
and in need of further clarification. 46:48
 Chapter 7 The team
discussed the use of deletion vectors to improve performance, but there
were many unknowns and decisions that needed to be made regarding how to
maintain and represent them in metadata. They also discussed the
implementation of nanosecond timestamps and the potential challenges it may
pose for engines like Spark.

Meeting Minutes from 2023-08-09 Iceberg Sync

2023-10-26 Thread Brian Olsen

Hey Iceberg Nation,
Everyone is welcome to attend syncs. Subscribe to this calendar

to receive a notification. Note: This meeting note is backdated as I forgot
to post it here earlier.


2023-08-09 (Meeting Recording  ⭕ )


   -

   Highlights
   -

  Gradle Version catalog support
   was added (Thanks, Max!)
  -

  Pushing down system functions by V2 filters
   (Thanks, Coney!)
  -

  Parquet: Cache codecs by name and level
   (Thanks, Bryan!)
  -

  Adaptive split sizing in Spark
   (Thanks, Anton!)
  -

  Optimizations to the DeleteFileIndex
   (Thanks, Anton!)
  -

  Python: add pyarrow hdfs support
   (Thanks, Luigi!)
  -

  Display Spark read metrics in Spark UI
   (Thanks, Karuppayya!)
  -

  Support creating branch on an empty table
   (Thanks, Coney!)
  -

  Awesome progress on the Rust implementation
   (Thanks, Jan, Renjie,
  Xuanwo)
  -

   Releases
   -

  Update on the Iceberg 1.4.0 release
  -

 Let’s further discuss changing to v2 as the default in this
 release (Anton)
 -

   Discussion
   -

  Discuss the location of client projects (rust, go, python). (Brian)
  -

  Documentation proposal. (Thread
  ,
  Doc
  
)
  (Brian)
  -

  https://github.com/apache/iceberg/issues/3547 (Anton)
  -

  V2 format as default for new tables (Anton)
  -

  Delete planning and column stats (Anton)
  -

  https://github.com/apache/iceberg/pull/8158 (Anton)
  -

  Write Avro map/list block sizes (Rusty)
  -

  Publishing Python Wheels


AI-generated chapter summaries: 0:00
 Chapter 1 Brian, Daniel,
Eduard, and Anton discussed various highlights and improvements in their
work, including the transition to a full-featured version catalog, fixing
issues with Parquet caching, adding support for Spark Readmetrics in the
Spark UI, optimizing delete file index, and implementing adaptive split
size in Spark 6:40 
Chapter 2 Anton, Daniel, Fokko, and Brian discussed various updates and
improvements to the Spark framework, including split planning, ACFS
support, performance enhancements, and integration with Bolar and GCP. They
also mentioned ongoing work and collaborations to further enhance the
ecosystem. 11:43 
Chapter 3 The team discussed the progress and blockers for the upcoming
release (1.4). They mentioned various features and improvements that were
being reviewed and targeted for inclusion, such as distributed planning in
Spark, view implementations, and multi-table commit. 17:22
 Chapter 4 Brian,
Anton, Daniel, and Fokko discussed whether to keep Rust separate or merge
it with the main repository. They considered the benefits of consistency in
code structure and documentation, as well as the potential challenges of
versioning and CI-CD complexity. 23:46
 Chapter 5 Brian,
Fokko, Anton, and others discussed the pros and cons of using a monorepo
for documentation and code. They considered the complexity of managing
multiple versions, the visibility of changes, and the need for a unified
approach. Brian proposed a solution involving a binary file to hide
versioned documentation, and they agreed to experiment and gather more
input before making a final decision. 33:59
 Chapter 6 The team
discussed the need to unify the Apache org site and the DOCK site, as well
as the challenges of keeping them consistent. They also explored the
possibility of adding a feature to specify the iceberg sort order in the
table creation statement in Spark. 44:59
 Chapter 7 Daniel,
Anton, and Bryan discussed inefficiencies in the delete planning and
execution process, as well as the possibility of using Z-standard by
default for new tables. They shared their findings and ideas for
improvement, with the in

Meeting Minutes from 2023-10-11 Iceberg Sync

2023-10-26 Thread Brian Olsen

Hey Iceberg Nation,
Everyone is welcome to attend syncs. Subscribe to this calendar

to receive a notification. Note: This meeting note is backdated as I forgot
to post it here earlier.

2023-10-11(Meeting Recording  ⭕ )

   -

   Highlights
   -

  1.4.0 was released! (Thanks, Anton!)
  -

 v2 and zstd defaults
 -

 Advisory partition size in Spark
 -

 Skip local sort for unordered writes in Spark
 -

 Distributed planning in Spark
 -

 AzureFileIO
 -

 Multi-table commits through REST
 -

 Removed Spark 3.1
 -

  Python moved to the iceberg-python
   repo (removed from main)
  -

  Flink  alter table column support  was added
   (1.17 only), like
  adding a new column, changing column position (Thanks, Yanghao Lin)
  -

  Metastore catalog support for views was added (Thanks, Eduard!)
  -

  Close to write support in Python, supports v1 and v2 metadata
  (Thanks, Fokko!)
  -

  Rust added read support for manifest lists (Thanks, ZENOTME)
  -

  Spark: clean up FileIO resources on executors (Thanks, Anton!)
  -

   Discussion
   -

  PR commit methods – standardize on squash?
  -

  Iceberg docs refactor  (try
  me )
  -

  Spec v3 changes:
  -

 New types
 -

BLOB
-

BSON/JSON
-

Timestamp{tz}_{ns,ms}


(not millis)
-

FLOAT16?
-

 Default values
 -

 Type promotion
 -

* to string (choose a format)
-

   What are the use cases for changing the type?
   -

   int/long to string
   -

   float/timestamp - why?
   -

   Bool to string should be allowed
   -

Long to timestamp (must be millis)
-

 Multi-column transforms
 -

Bucket v2
-

Geo?
-

 Location/path requirements (recommendations)
 -

 Owned locations (discussion
 )
 -

 Delete vectors (discussion
 )
 -

 Allowing relative paths
 -

  Partition stats spec and discussion in PR 7105
  .

Kafka Connect (discussion
)

AI-generated chapter summaries: 0:00
 Chapter 0 Introduction
5:14  Chapter 1
Highlights Ryan thanks Anton for releasing v1.4 with many bug fixes and
changes, including defaulting to v2 format and Z standard for data
compression. Azure file IO is now available, with native support for
multi-table commits in Spark. pyIceberg project moved to a new repository
and new Python support was added. 12:01
 Chapter 2 PR commit
methods and repository setup. Anton highlights recent improvements in
Spark, including file cleanup and manifest file read support and plans to
discuss spec v3 changes with the community. The group discusses PR commit
methods, suggesting standardizing across repositories to use squash and
merge by default, rather than merge commits. There was concern about
enforcing linear history on the Java side, citing potential issues with
rebase and time zones. One suggestion was bringing the issue of
inconsistent commit messages to the community for resolution. A consensus
is built around squashing commits to make them more meaningful and easier
to understand. 19:30 
Chapter 3 Improving Iceberg Docs with a mono repo. Brian is refactoring the
iceberg documentation to move it back into the main iceberg repo,
simplifying maintenance and improving collaboration. He proposes to create
a single documentation site containing the static site and for all versions
of docs, solving problems with multiple sources and making releases easier.
The plan is to merge an initial PR and build consensus, then replace the
current ASF documentation branch and repoint it back to the main re

Re: Meeting Minutes from 2023-10-11 Iceberg Sync

2023-10-27 Thread Brian Olsen

The spacing was after sending the email. If you click on the YouTube  Link,
it splits the YouTube video into chapters and spaces them out. They are
more legible there.

I’ll make sure to add bulletpoints moving forward.

On Thu, Oct 26, 2023 at 11:00 PM Xuanwo  wrote:

> Thanks for the meeting recoding!
>
> The "AI-generated chapter summaries" don't seem very readable. Can this be
> improved?
>
> On Fri, Oct 27, 2023, at 05:25, Brian Olsen wrote:
>
> Hey Iceberg Nation,
> Everyone is welcome to attend syncs. Subscribe to this calendar
> <https://calendar.google.com/calendar/embed?src=3905d492f1b450ba0712f2ae6afa76eb757f13d85220cc03aa4527885adc5629%40group.calendar.google.com&ctz=Asia%2FShanghai>
> to receive a notification. Note: This meeting note is backdated as I forgot
> to post it here earlier.
>
> 2023-10-11(Meeting Recording <https://youtu.be/euWtAKo_bV4> ⭕ )
>
>-
>
>Highlights
>-
>
>   1.4.0 was released! (Thanks, Anton!)
>   -
>
>  v2 and zstd defaults
>  -
>
>  Advisory partition size in Spark
>  -
>
>  Skip local sort for unordered writes in Spark
>  -
>
>  Distributed planning in Spark
>  -
>
>  AzureFileIO
>  -
>
>  Multi-table commits through REST
>  -
>
>  Removed Spark 3.1
>  -
>
>   Python moved to the iceberg-python
>   <https://github.com/apache/iceberg-python> repo (removed from main)
>   -
>
>   Flink  alter table column support  was added
>   <https://github.com/apache/iceberg/pull/7628> (1.17 only), like
>   adding a new column, changing column position (Thanks, Yanghao Lin)
>   -
>
>   Metastore catalog support for views was added (Thanks, Eduard!)
>   -
>
>   Close to write support in Python, supports v1 and v2 metadata
>   (Thanks, Fokko!)
>   -
>
>   Rust added read support for manifest lists (Thanks, ZENOTME)
>   -
>
>   Spark: clean up FileIO resources on executors (Thanks, Anton!)
>   -
>
>Discussion
>-
>
>   PR commit methods – standardize on squash?
>   -
>
>   Iceberg docs refactor <https://github.com/apache/iceberg/pull/8659>
>   (try me
>   <https://github.com/bitsondatadev/iceberg/tree/new-docs/docs-new>)
>   -
>
>   Spec v3 changes:
>   -
>
>  New types
>  -
>
> BLOB
> -
>
> BSON/JSON
> -
>
> Timestamp{tz}_{ns,ms}
> 
> <https://docs.google.com/document/d/1bE1DcEGNzZAMiVJSZ0X1wElKLNkT9kRkk0hDlfkXzvU/edit>
> (not millis)
> -
>
> FLOAT16?
> -
>
>  Default values
>  -
>
>  Type promotion
>  -
>
> * to string (choose a format)
> -
>
>What are the use cases for changing the type?
>-
>
>int/long to string
>-
>
>float/timestamp - why?
>-
>
>Bool to string should be allowed
>-
>
> Long to timestamp (must be millis)
> -
>
>  Multi-column transforms
>  -
>
> Bucket v2
> -
>
> Geo?
> -
>
>  Location/path requirements (recommendations)
>  -
>
>  Owned locations (discussion
>  <https://lists.apache.org/thread/3fx8povnsq0f4g1xzj38snplr6d3ch1r>
>  )
>  -
>
>  Delete vectors (discussion
>  <https://lists.apache.org/thread/gr3g5rrr60fhvy0mrdj4s4w9x8c3v58g>
>  )
>  -
>
>  Allowing relative paths
>  -
>
>   Partition stats spec and discussion in PR 7105
>   <https://github.com/apache/iceberg/pull/7105>.
>
> Kafka Connect (discussion
> <https://lists.apache.org/thread/d9h22z2ydcpvjxp53yl6w96xoy3dp33h>)
>
>
> AI-generated chapter summaries: 0:00
> <https://www.youtube.com/watch?v=euWtAKo_bV4&t=0s> Chapter 0 Introduction
> 5:14 <https://www.youtube.com/watch?v=euWtAKo_bV4&t=314s> Chapter 1
> Highlights Ryan thanks Anton for releasing v1.4 with many bug fixes and
> changes, including defaulting to v2 format and Z standard for data
> compression. Azure file IO is now available, with native support for
> multi-table commits in Spark. pyIceberg project moved to a new repository

Iceberg Logo Fix and Iceberg Swag Shop

2023-10-31 Thread Brian Olsen

Hey Iceberg Nation,

I wanted to address an issue with the Iceberg Logo used by the ASF.
Somewhere along the way, a hole was added to the Iceberg logo (global
warming? 😬). I first noticed it when uploading the logo to the Wikipedia
Commons ,
but thought it was perhaps intentional at the time.

This came up again when I was looking for options to buy an Iceberg shirt
on RedBubble from the ASF Official store
. However, when looking at the
shirts I remembered the holey Iceberg seeing the logo on a non-white shirt.

[image: image.png]

I followed up with Ryan, and he said this hole wasn't originally there and
isn't supposed to be there. I want to add the RedBubble shop to the Iceberg
site. I believe having a way for all of us to show our ❤️ for Iceberg is
one of the best ways to build not only awareness but a common identity
around the project.

The Tabular design team has created a fixed SVG file, I just wanted to
better understand the steps necessary to get this approved by the PMC, and
where we should submit this to update the ASF logo and get them to
ultimately update the redbubble site. Following that, I will add a PR to
add the Redbubble site with the fixed logo to our site.

Thanks all,
Bits

[PROPOSAL] Use Microsoft Style Guide for documentation

2023-11-01 Thread Brian Olsen

Hey Iceberg Nation, As I've gone through the Iceberg docs, I've noticed a
lot of inconsistencies with terminology, grammar, and style. As a
distributed community, we have a lot of non-native English speakers reading
and writing our documentation. I propose we adopt the Microsoft Style Guide
 to improve the
communication and consistency of the docs. Common rules like defaulting to
use present tense

not only make the documentation consistent but also more accessible for
those who struggle to understand complex conjugations. Then there are
examples like making sure to capitalize proper nouns

like (Spark, Flink, Trino, Apache Software Foundation, etc...). You may
think, that's great Brian, but good luck getting everyone reading the
project and following that. I also want to propose adding a prose linter
called Vale , that will enable us to add the existing
rules for the Microsoft Style Guide
, and our
own custom rules to ensure consistent style with documentation changes.
Let's discuss this in the sync tomorrow! Bits

[PROPOSAL] Release process to improve communication of the Iceberg project

2023-11-01 Thread Brian Olsen

Hey Iceberg Nation,

Last proposal from me today I promise! Another issue I've seen as I've
looked over the documentation and the release process has detracted from
both the contributor and developer experience when using Iceberg. For
example, Missing documentation for features

which
causes real issues during upgrades for Iceberg users. In general, there's
not really a process to understand when changes should or shouldn't include
documentation.

As someone trying to improve the Developer Experience, I want to make using
all of the latest and awesome features you all are providing for the
community. Although I pride myself on repository detective work, I am not a
mind reader and don't have the time to play code safari to understand every
change. It is incredibly helpful to understand when a pull request is not a
user-facing change (test, documentation, internal refactoring, etc...) and
when the user or contributors should be aware of something so we can notate
that in our documentation.

Let's do better. Let's make a better developer and contributor experience!

What I don't want to do is add red tape to the development process that
inhibits our ability to deliver. That said, I want to iteratively build a
process that enables my team and others in the community interested in
lowering the on-ramp for developers for this community. Without more words,
here's what I'm proposing:


   1. Roadmap tracked on GitHub Projects as opposed to website.
   1. Static sites are not the easiest to keep up-to-date. It's a lot of
  work to put a GitHub PR process in front of updating a word or some small
  change. Rather, why don't we just use GitHub Projects and expose that on
  the website?
  2. We already use GitHub Projects and it’s closer to the source.
  2. Some ideas gleaned from the Trino project
  1. Keep a running list of merged PRs in the description of the pull
  request that updated the
  release notes and add these PRs to the release milestone (ex. 1.4.0
  )
  2. Add a PR template with a release notes and docs section. Also add
  checkboxes to communicate that the commit(s) are not a
user-facing change,
  and no release notes/docs are needed
  3. Responsibility of all of the committers/authors to add their own
  release notes/docs and verify they have communicated effectively

Happy to discuss this at the sync!

Re: [PROPOSAL] Use Microsoft Style Guide for documentation

2023-11-02 Thread Brian Olsen

@Yufei,

Regarding:

> Love the following example. Not sure if Vale can catch this and provide
> suggestions. It may be only possible with LLM.
>
>> Replace this: If you're ready to purchase Office 365 for your
>> organization, contact your Microsoft account representative.
>> With this: Ready to buy? Contact us.
>
>
Think of Vale as a Spelling/Grammar Check that runs at compile time much
like syntax linters. They general idea is that there is some application of
regex that checks for patterns in the language and flags them. The flags
will give some indication of what rule(s) were violated and depending on
the settings, will either throw a non-zero exit, or warning to print a
concerning warning at build time.

Vale uses various abstractions like styles
<https://vale.sh/docs/topics/styles/>, vocabulary
<https://vale.sh/docs/topics/vocab/>, and packaging to pull together the
list of rules/regexes to flag these issues. One common and concrete example
is using passive voice
<https://learn.microsoft.com/en-us/style-guide/grammar/verbs#active-and-passive-voice>.
This Vale style is encoded in the community-driven Microsoft style-guide
<https://github.com/errata-ai/Microsoft/blob/master/Microsoft/Passive.yml> and
will flag sentences like "Apache Iceberg 1.4.2 was released on November 2,
2023". It is then up to you to rephrase to the proper grammar to remove
that message "The Iceberg community released Apache Iceberg 1.4.2 on
November 2, 2023".

I'm not comfortable with using LLMs to provide knowledge yet when our
existing documentation is lacking a lot of context and has an inconsistent
tone. As we grow a quality corpus around Iceberg and ecosystems, I am
definitely interested in building tools or integrating with GitHub AI tools
to help generate documentation and PR messaging that the engineer will
later tweak. One step at a time though.

On Wed, Nov 1, 2023 at 5:14 PM Yufei Gu  wrote:

> +1 Love the following example. Not sure if Vale can catch this and provide
> suggestions. It may be only possible with LLM.
>
>> Replace this: If you're ready to purchase Office 365 for your
>> organization, contact your Microsoft account representative.
>> With this: Ready to buy? Contact us.
>
>
> Yufei
>
>
> On Wed, Nov 1, 2023 at 12:20 PM Ryan Blue  wrote:
>
>> +1
>>
>> On Wed, Nov 1, 2023 at 6:38 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi Brian
>>>
>>> I like the proposal, it sounds like a good way to "align" our
>>> documentation.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On Wed, Nov 1, 2023 at 8:20 AM Brian Olsen 
>>> wrote:
>>> >
>>> > Hey Iceberg Nation, As I've gone through the Iceberg docs, I've
>>> noticed a lot of inconsistencies with terminology, grammar, and style. As a
>>> distributed community, we have a lot of non-native English speakers reading
>>> and writing our documentation. I propose we adopt the Microsoft Style Guide
>>> to improve the communication and consistency of the docs. Common rules like
>>> defaulting to use present tense not only make the documentation consistent
>>> but also more accessible for those who struggle to understand complex
>>> conjugations. Then there are examples like making sure to capitalize proper
>>> nouns like (Spark, Flink, Trino, Apache Software Foundation, etc...). You
>>> may think, that's great Brian, but good luck getting everyone reading the
>>> project and following that. I also want to propose adding a prose linter
>>> called Vale, that will enable us to add the existing rules for the
>>> Microsoft Style Guide, and our own custom rules to ensure consistent style
>>> with documentation changes.
>>> > Let's discuss this in the sync tomorrow! Bits
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Iceberg Logo Fix and Iceberg Swag Shop

2023-12-06 Thread Brian Olsen

Hey all,

I wanted to resurface this and see if any PMC could take a look. Thanks!

On Wed, Nov 1, 2023 at 8:37 AM Jean-Baptiste Onofré  wrote:

> Hi Brian,
>
> Good catch.
>
> We need to get approval from the PMC, and notify ASF VP Brand Management
> (Mark Thomas) by sending a message to tradema...@apache.org.
> We can also be in touch with ASF comdev and marketing teams to help to
> update the logo and so.
>
> I can help with this, don't hesitate to ping me !
>
> Regards
> JB
>
> On Wed, Nov 1, 2023 at 7:56 AM Brian Olsen 
> wrote:
>
>> Hey Iceberg Nation,
>>
>> I wanted to address an issue with the Iceberg Logo used by the ASF.
>> Somewhere along the way, a hole was added to the Iceberg logo (global
>> warming? 😬). I first noticed it when uploading the logo to the
>> Wikipedia Commons
>> <https://commons.wikimedia.org/wiki/File:Apache_Iceberg_Logo.svg>, but
>> thought it was perhaps intentional at the time.
>>
>> This came up again when I was looking for options to buy an Iceberg shirt
>> on RedBubble from the ASF Official store
>> <https://www.redbubble.com/shop/ap/40954182>. However, when looking at
>> the shirts I remembered the holey Iceberg seeing the logo on a non-white
>> shirt.
>>
>> [image: image.png]
>>
>> I followed up with Ryan, and he said this hole wasn't originally there
>> and isn't supposed to be there. I want to add the RedBubble shop to the
>> Iceberg site. I believe having a way for all of us to show our ❤️ for
>> Iceberg is one of the best ways to build not only awareness but a common
>> identity around the project.
>>
>> The Tabular design team has created a fixed SVG file, I just wanted to
>> better understand the steps necessary to get this approved by the PMC, and
>> where we should submit this to update the ASF logo and get them to
>> ultimately update the redbubble site. Following that, I will add a PR to
>> add the Redbubble site with the fixed logo to our site.
>>
>> Thanks all,
>> Bits
>>
>>
>>
>

Re: Iceberg Logo Fix and Iceberg Swag Shop

2023-12-06 Thread Brian Olsen

Thanks Weston and Russell,

To see the old file, look at the Wikimedia Commons image:
https://en.wikipedia.org/wiki/Apache_Iceberg#/media/File:Apache_Iceberg_Logo.svg.
You'll notice the transparent background reveals a triangular hole.

You can also see this in the Apache store on RedBubble when looking on
backgrounds that are not white: https://www.redbubble.com/shop/ap/40954182

If you look at the image we use in the Tabular newsletter, our designers
closed up that hole:
https://tabular.io/images/blog/iceberg-announcements/october-2023.webp

Those are the only public images I know of. Let me know if there are any
issues viewing them.

On Wed, Dec 6, 2023 at 9:53 AM Weston Pace  wrote:

> BTW: ASF mailing lists strip attachments and so you will need to use a
> gist or some other sharing.
>
> On Wed, Dec 6, 2023, 7:22 AM Russell Spitzer 
> wrote:
>
>> The original email has a broken png link so I was never able to see the
>> issue, could you attach the before and after so I can see the difference?
>>
>> On Dec 6, 2023, at 9:07 AM, Brian Olsen  wrote:
>>
>> Hey all,
>>
>> I wanted to resurface this and see if any PMC could take a look. Thanks!
>>
>> On Wed, Nov 1, 2023 at 8:37 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi Brian,
>>>
>>> Good catch.
>>>
>>> We need to get approval from the PMC, and notify ASF VP Brand Management
>>> (Mark Thomas) by sending a message to tradema...@apache.org.
>>> We can also be in touch with ASF comdev and marketing teams to help to
>>> update the logo and so.
>>>
>>> I can help with this, don't hesitate to ping me !
>>>
>>> Regards
>>> JB
>>>
>>> On Wed, Nov 1, 2023 at 7:56 AM Brian Olsen 
>>> wrote:
>>>
>>>> Hey Iceberg Nation,
>>>>
>>>> I wanted to address an issue with the Iceberg Logo used by the ASF.
>>>> Somewhere along the way, a hole was added to the Iceberg logo (global
>>>> warming? 😬). I first noticed it when uploading the logo to the
>>>> Wikipedia Commons
>>>> <https://commons.wikimedia.org/wiki/File:Apache_Iceberg_Logo.svg>, but
>>>> thought it was perhaps intentional at the time.
>>>>
>>>> This came up again when I was looking for options to buy an Iceberg
>>>> shirt on RedBubble from the ASF Official store
>>>> <https://www.redbubble.com/shop/ap/40954182>. However, when looking at
>>>> the shirts I remembered the holey Iceberg seeing the logo on a non-white
>>>> shirt.
>>>>
>>>> [image: image.png]
>>>>
>>>> I followed up with Ryan, and he said this hole wasn't originally there
>>>> and isn't supposed to be there. I want to add the RedBubble shop to the
>>>> Iceberg site. I believe having a way for all of us to show our ❤️ for
>>>> Iceberg is one of the best ways to build not only awareness but a common
>>>> identity around the project.
>>>>
>>>> The Tabular design team has created a fixed SVG file, I just wanted to
>>>> better understand the steps necessary to get this approved by the PMC, and
>>>> where we should submit this to update the ASF logo and get them to
>>>> ultimately update the redbubble site. Following that, I will add a PR to
>>>> add the Redbubble site with the fixed logo to our site.
>>>>
>>>> Thanks all,
>>>> Bits
>>>>
>>>>
>>>>
>>>
>>

Re: Community Meeting Minutes ?

2023-12-07 Thread Brian Olsen

Hey Wing Yew,

Sorry about this. I am just about to publish the last two. Me and the other
person that is responsible for these were hit by a series of family and
medical issues so apologies. I will put some better backups into place in
the unlikely event we are both out of commission.

 Thanks for the push and stand by for the meeting minutes.

On Wed, Dec 6, 2023 at 3:06 PM Wing Yew Poon 
wrote:

> The meeting minutes and a link to the recording used to be sent out to
> this list regularly soon after the community sync. I have not been able to
> attend the sync recently and I haven't seen the minutes for the last two
> syncs. Can we please maintain the practice of sending the minutes and
> recording out?
> Thanks,
> Wing Yew
>
>
> On Fri, Oct 27, 2023 at 2:40 AM Jean-Baptiste Onofré 
> wrote:
>
>> Thanks Brian, much appreciated!
>>
>> Regards
>> JB
>>
>> On Thu, Oct 26, 2023 at 10:29 PM Brian Olsen 
>> wrote:
>> >
>> > Thanks for the reminder here JB. I just created a list to follow for
>> this process so I don't forget. At some point, I'll add it to the
>> documentation so that anyone can run this over time. I will share out the
>> last few meeting minutes in their own threads now.
>> >
>> > On Thu, Oct 12, 2023 at 9:03 AM Jean-Baptiste Onofré 
>> wrote:
>> >>
>> >> Hi guys,
>> >>
>> >> Thanks for the community meeting yesterday, it was super interesting
>> >> and motivating :)
>> >>
>> >> As we say at Apache: "If it didn't happen on the mailing list, it
>> >> never happened" :)
>> >> In order to give a chance to anyone in the community to see the topics
>> >> and participate, it would be great to share the meeting minutes on the
>> >> mailing list.
>> >>
>> >> I know Brian did that in July. It would be great to do it
>> "systematically".
>> >>
>> >> @Brian do you mind sharing the meeting minutes on the mailing list ?
>> >> Do you need my help to complete/review ?
>> >> Maybe we can add it on the website too ?
>> >>
>> >> Thanks !
>> >> Regards
>> >> JB
>>
>

Meeting Minutes from 2023-11-22 Iceberg Sync

2023-12-07 Thread Brian Olsen

Key Takeaways 0:00 
Introduction 4:33  No
urgent need for 1.4.3 release currently, will wait for meaningful bug fixes
25:43  Branches should
reflect current table schema, not schema at time of branch snapshot 34:13
 Need to fail time
travel queries on branches since history not maintained Topics Recent
Updates 0:17  Good
improvements to delete performance from Anton and others 1:16
 Added ability to filter
stats after planning to avoid deserializing unneeded column stats 2:26
 Added metrics to Spark
planning phase 2:50 
Progress on REST API support in Rust and Python 3:44
 Table metadata updates
progressing in Python Iceberg 1.4.3 Release 5:09
 Avro CVE mitigation
not needed for Iceberg, no urgent need for 1.4.3 currently 4:28
 Will wait for merge
consistency bug fix before next patch release Dependency Version Upgrades
14:08  Avoid major
version bumps unless shaded and proven reliable 14:45
 Prefer major upgrades
in major releases, minor in minor, patches in patches 15:14
 Watch for breaking
changes even in patch releases 21:38
 Prioritize avoiding
downstream disruption over staying on latest versions Branch Schema
Behavior 34:26 
Branches should reflect current table schema like main branch 24:28
 Makes sense for tags
to preserve schema at tag time 24:03
 Will update behavior
so branches use current table schema Time Travel on Branches 33:00
 As of time not
supported on branches currently since history not maintained 34:09
 Should fail time
travel queries on branches 34:17
 Will add test and
update behavior Next Steps: 4:23
 Release 1.4.3 after
meaningful bug fixes 24:06
 Review PR updating
branch schema behavior 34:13
 Add test and fix for
time travel on branches

Meeting Minutes from 2023-11-01 Iceberg Sync

2023-12-07 Thread Brian Olsen

Key Takeaways 0:00 
Introduction 11:07 
Logo update proposed to fix artifact issue 10:38
 Release process
improvements proposed to better track changes 23:50
 Roadmap to be tracked
in GitHub issues/projects rather than static site 35:15
 Style guide adoption
proposed to improve documentation consistency Topics Logo update 11:47
 Brian noticed an
artifact issue on some backgrounds where there is a hole/gap at the top of
the iceberg logo 12:00 
This was likely introduced during logo file transfers from Netflix 12:07
 Brian has mockup to
fix - needs PMC approval and then follow up with Apache Branding 13:36
 Goal is better
consistency and representation at Apache events - Release Process
Improvements 10:51 
Brian proposed better tracking of user-facing changes between releases, to
improve release notes and documentation 11:08
 Could track changes in
a release-specific PR, requiring committers to log changes there 21:23
 Automation could also
help generate a list of changes for the release manager 22:18
 General agreement
this could reduce gaps, but should avoid being too burdensome - Roadmap
Tracking 24:27 
Currently roadmap is tracked statically on the Iceberg site 24:31
 Brian proposes moving
it to GitHub issues/projects since that's already part of existing process
25:01  Roadmap should
be the source of truth - site could potentially embed or link to GitHub
26:11  General
agreement to use GitHub issues/projects more formally for roadmap tracking
- Documentation Style Guide 35:15
 Brian proposed
adopting a style guide and using a linter like Vale to improve
documentation consistency 35:15
 Main goals are
inclusive language, consistent grammar/conjugations, capitalization 38:59
 General guidance to
avoid being too strict or blocking contributions 39:53
 Linter warnings could
be fixed periodically rather than blocking merges 22:37
 Follow up on dev@
thread for additional discussion - Next Steps: - Create PR with proposed
logo update - 22:43 
Start thread on dev@ list to discuss process improvements - Migrate roadmap
tracking to GitHub - Follow up on style guide proposal on dev@ thread

Re: Meeting Minutes from 2023-11-01 Iceberg Sync

2023-12-07 Thread Brian Olsen

Apologies, this should fix the formatting issues.

Iceberg Community Sync (Recorded) - November 01
VIEW RECORDING: https://www.youtube.com/watch?v=1yljcXTAOuA
Meeting Purpose:

Weekly Iceberg community sync up meeting to discuss recent
developments, upcoming releases, and proposals.

Key Takeaways

  - Logo update proposed to fix artifact issue
  - Release process improvements proposed to better track changes
  - Roadmap to be tracked in GitHub issues/projects rather than static site
  - Style guide adoption proposed to improve documentation consistency

Topics:

Logo Update

  - Brian noticed an artifact issue on some backgrounds where there is
a hole/gap at the top of the iceberg logo
  - This was likely introduced during logo file transfers from Netflix
  - Brian has mockup to fix - needs PMC approval and then follow up
with Apache Branding
  - Goal is better consistency and representation at Apache events

Release Process Improvements

  - Brian proposed better tracking of user-facing changes between
releases, to improve release notes and documentation
  - Could track changes in a release-specific PR, requiring committers
to log changes there
  - Automation could also help generate a list of changes for the
release manager
  - General agreement this could reduce gaps, but should avoid being
too burdensome

Roadmap Tracking

  - Currently roadmap is tracked statically on the Iceberg site
  - Brian proposes moving it to GitHub issues/projects since that's
already part of existing process
  - Roadmap should be the source of truth - site could potentially
embed or link to GitHub
  - General agreement to use GitHub issues/projects more formally for
roadmap tracking

Documentation Style Guide

  - Brian proposed adopting a style guide and using a linter like Vale
to improve documentation consistency
  - Main goals are inclusive language, consistent
grammar/conjugations, capitalization
  - General guidance to avoid being too strict or blocking contributions
  - Linter warnings could be fixed periodically rather than blocking merges
  - Follow up on dev@ thread for additional discussion

Next Steps:

  - Create PR with proposed logo update
  - Start thread on dev@ list to discuss process improvements
  - Migrate roadmap tracking to GitHub
  - Follow up on style guide proposal on dev@ thread

On Thu, Dec 7, 2023 at 4:48 PM Brian Olsen  wrote:
>
> Key Takeaways 0:00 Introduction 11:07 Logo update proposed to fix artifact 
> issue 10:38 Release process improvements proposed to better track changes 
> 23:50 Roadmap to be tracked in GitHub issues/projects rather than static site 
> 35:15 Style guide adoption proposed to improve documentation consistency 
> Topics Logo update 11:47 Brian noticed an artifact issue on some backgrounds 
> where there is a hole/gap at the top of the iceberg logo 12:00 This was 
> likely introduced during logo file transfers from Netflix 12:07 Brian has 
> mockup to fix - needs PMC approval and then follow up with Apache Branding 
> 13:36 Goal is better consistency and representation at Apache events - 
> Release Process Improvements 10:51 Brian proposed better tracking of 
> user-facing changes between releases, to improve release notes and 
> documentation 11:08 Could track changes in a release-specific PR, requiring 
> committers to log changes there 21:23 Automation could also help generate a 
> list of changes for the release manager 22:18 General agreement this could 
> reduce gaps, but should avoid being too burdensome - Roadmap Tracking 24:27 
> Currently roadmap is tracked statically on the Iceberg site 24:31 Brian 
> proposes moving it to GitHub issues/projects since that's already part of 
> existing process 25:01 Roadmap should be the source of truth - site could 
> potentially embed or link to GitHub 26:11 General agreement to use GitHub 
> issues/projects more formally for roadmap tracking - Documentation Style 
> Guide 35:15 Brian proposed adopting a style guide and using a linter like 
> Vale to improve documentation consistency 35:15 Main goals are inclusive 
> language, consistent grammar/conjugations, capitalization 38:59 General 
> guidance to avoid being too strict or blocking contributions 39:53 Linter 
> warnings could be fixed periodically rather than blocking merges 22:37 Follow 
> up on dev@ thread for additional discussion - Next Steps: - Create PR with 
> proposed logo update - 22:43 Start thread on dev@ list to discuss process 
> improvements - Migrate roadmap tracking to GitHub - Follow up on style guide 
> proposal on dev@ thread

Re: Meeting Minutes from 2023-11-22 Iceberg Sync

2023-12-07 Thread Brian Olsen

Apologies, this should fix the formatting issues.

Iceberg Community Sync (Recorded) - November 22
VIEW RECORDING: https://www.youtube.com/watch?v=iz0Oex1hQA0
Meeting Purpose:

Weekly Iceberg dev sync meeting to discuss recent updates, issues, and
next steps

Key Takeaways

  - No urgent need for 1.4.3 release currently, will wait for
meaningful bug fixes
  - Branches should reflect current table schema, not schema at time
of branch snapshot
  - Need to fail time travel queries on branches since history not maintained

Topics:

Recent Updates

  - Good improvements to delete performance from Anton and others
  - Added ability to filter stats after planning to avoid
deserializing unneeded column stats
  - Added metrics to Spark planning phase
  - Progress on REST API support in Rust and Python
  - Table metadata updates progressing in Python

Iceberg 1.4.3 Release

  - Avro CVE mitigation not needed for Iceberg, no urgent need for
1.4.3 currently
  - Will wait for merge consistency bug fix before next patch release

Dependency Version Upgrades

  - Avoid major version bumps unless shaded and proven reliable
  - Prefer major upgrades in major releases, minor in minor, patches in patches
  - Watch for breaking changes even in patch releases
  - Prioritize avoiding downstream disruption over staying on latest versions

Branch Schema Behavior

  - Branches should reflect current table schema like main branch
  - Makes sense for tags to preserve schema at tag time
  - Will update behavior so branches use current table schema

Time Travel on Branches

  - As of time not supported on branches currently since history not maintained
  - Should fail time travel queries on branches
  - Will add test and update behavior

Next Steps:

  - Release 1.4.3 after meaningful bug fixes
  - Review PR updating branch schema behavior
  - Add test and fix for time travel on branches

On Thu, Dec 7, 2023 at 4:48 PM Brian Olsen  wrote:
>
> Key Takeaways 0:00 Introduction 4:33 No urgent need for 1.4.3 release 
> currently, will wait for meaningful bug fixes 25:43 Branches should reflect 
> current table schema, not schema at time of branch snapshot 34:13 Need to 
> fail time travel queries on branches since history not maintained Topics 
> Recent Updates 0:17 Good improvements to delete performance from Anton and 
> others 1:16 Added ability to filter stats after planning to avoid 
> deserializing unneeded column stats 2:26 Added metrics to Spark planning 
> phase 2:50 Progress on REST API support in Rust and Python 3:44 Table 
> metadata updates progressing in Python Iceberg 1.4.3 Release 5:09 Avro CVE 
> mitigation not needed for Iceberg, no urgent need for 1.4.3 currently 4:28 
> Will wait for merge consistency bug fix before next patch release Dependency 
> Version Upgrades 14:08 Avoid major version bumps unless shaded and proven 
> reliable 14:45 Prefer major upgrades in major releases, minor in minor, 
> patches in patches 15:14 Watch for breaking changes even in patch releases 
> 21:38 Prioritize avoiding downstream disruption over staying on latest 
> versions Branch Schema Behavior 34:26 Branches should reflect current table 
> schema like main branch 24:28 Makes sense for tags to preserve schema at tag 
> time 24:03 Will update behavior so branches use current table schema Time 
> Travel on Branches 33:00 As of time not supported on branches currently since 
> history not maintained 34:09 Should fail time travel queries on branches 
> 34:17 Will add test and update behavior Next Steps: 4:23 Release 1.4.3 after 
> meaningful bug fixes 24:06 Review PR updating branch schema behavior 34:13 
> Add test and fix for time travel on branches

Re: [VOTE] Release Apache Iceberg 1.4.3 RC0

2023-12-26 Thread Brian Olsen

Hey JB,

This directly affects the Trino engine and they have temporarily reverted
to an older version of Iceberg. A lot of folks in this community do well to
keep up with the weekly release cycle to stay ahead of regressions.

https://github.com/trinodb/trino/pull/20159

I would recommend us cutting the release as soon as possible since the
Trino release will happen tomorrow.

On Tue, Dec 26, 2023 at 11:46 AM Jean-Baptiste Onofré 
wrote:

> Hi,
>
> As we are in Christmas time, I give an extra day for people to vote.
>
> Regards
> JB
>
> On Thu, Dec 21, 2023 at 4:53 PM Jean-Baptiste Onofré 
> wrote:
> >
> > Hi Everyone,
> >
> > I propose that we release the following RC as the official Apache
> > Iceberg 1.4.3 release.
> >
> > The commit ID is 9a5d24fee239352021a9a73f6a4cad8ecf464f01
> > * This corresponds to the tag: apache-iceberg-1.4.3-rc0
> > * https://github.com/apache/iceberg/commits/apache-iceberg-1.4.3-rc0
> > *
> https://github.com/apache/iceberg/tree/9a5d24fee239352021a9a73f6a4cad8ecf464f01
> >
> > The release tarball, signature, and checksums are here:
> > *
> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.4.3-rc0
> >
> > You can find the KEYS file here:
> > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
> >
> > Convenience binary artifacts are staged on Nexus. The Maven repository
> URL is:
> > *
> https://repository.apache.org/content/repositories/orgapacheiceberg-1149/
> >
> > Please download, verify, and test.
> >
> > Please vote in the next 72 hours. (Weekends excluded)
> >
> > [ ] +1 Release this as Apache Iceberg 1.4.3
> > [ ] +0
> > [ ] -1 Do not release this because...
> >
> > Only PMC members have binding votes, but other community members are
> > encouraged to cast
> > non-binding votes. This vote will pass if there are 3 binding +1 votes
> > and more binding
> > +1 votes than -1 votes.
>

Re: Community Meeting Minutes ?

2024-01-03 Thread Brian Olsen

Hey all,

I am just about to push them this morning along with the announcement that
today’s meeting is cancelled.

I apologize for yet again having this be delayed. I started working on a
script to automate the compilation of the AI summary + meeting minutes to a
youtube friendly format for the links and meant to do them manually  I
pulled them together manually and will push them shortly.

Sorry, this won’t happen again. As in, I’ll do it manually until I have a
faster way to compile the notes.
What’s the general expectation for Apache projects or specifically our
project to have the meeting minutes to be posted? The day of or the next
day?


On Wed, Jan 3, 2024 at 12:29 AM Jean-Baptiste Onofré 
wrote:

> Hi,
>
> Agree: we should have a list of "community meeting managers" to handle
> the record, update website, etc.
>
> Regards
> JB
>
> On Wed, Jan 3, 2024 at 7:04 AM Ajantha Bhat  wrote:
> >
> > Hi,
> >
> > I think we didn't share the meeting minutes from December.
> > I would like to volunteer to be the backup for sharing this (provide me
> the permissions for the recording folder).
> >
> > Thanks,
> > Ajantha
> >
> > On Sat, Dec 9, 2023 at 12:21 AM Wing Yew Poon
>  wrote:
> >>
> >> Brian,
> >> Thanks for sending out the meeting minutes (the updated version looks
> good!).
> >> - Wing Yew
> >>
> >>
> >> On Thu, Dec 7, 2023 at 2:07 PM Brian Olsen 
> wrote:
> >>>
> >>> Hey Wing Yew,
> >>>
> >>> Sorry about this. I am just about to publish the last two. Me and the
> other person that is responsible for these were hit by a series of family
> and medical issues so apologies. I will put some better backups into place
> in the unlikely event we are both out of commission.
> >>>
> >>>  Thanks for the push and stand by for the meeting minutes.
> >>>
> >>> On Wed, Dec 6, 2023 at 3:06 PM Wing Yew Poon
>  wrote:
> >>>>
> >>>> The meeting minutes and a link to the recording used to be sent out
> to this list regularly soon after the community sync. I have not been able
> to attend the sync recently and I haven't seen the minutes for the last two
> syncs. Can we please maintain the practice of sending the minutes and
> recording out?
> >>>> Thanks,
> >>>> Wing Yew
> >>>>
> >>>>
> >>>> On Fri, Oct 27, 2023 at 2:40 AM Jean-Baptiste Onofré 
> wrote:
> >>>>>
> >>>>> Thanks Brian, much appreciated!
> >>>>>
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>> On Thu, Oct 26, 2023 at 10:29 PM Brian Olsen <
> bitsondata...@gmail.com> wrote:
> >>>>> >
> >>>>> > Thanks for the reminder here JB. I just created a list to follow
> for this process so I don't forget. At some point, I'll add it to the
> documentation so that anyone can run this over time. I will share out the
> last few meeting minutes in their own threads now.
> >>>>> >
> >>>>> > On Thu, Oct 12, 2023 at 9:03 AM Jean-Baptiste Onofré <
> j...@nanthrax.net> wrote:
> >>>>> >>
> >>>>> >> Hi guys,
> >>>>> >>
> >>>>> >> Thanks for the community meeting yesterday, it was super
> interesting
> >>>>> >> and motivating :)
> >>>>> >>
> >>>>> >> As we say at Apache: "If it didn't happen on the mailing list, it
> >>>>> >> never happened" :)
> >>>>> >> In order to give a chance to anyone in the community to see the
> topics
> >>>>> >> and participate, it would be great to share the meeting minutes
> on the
> >>>>> >> mailing list.
> >>>>> >>
> >>>>> >> I know Brian did that in July. It would be great to do it
> "systematically".
> >>>>> >>
> >>>>> >> @Brian do you mind sharing the meeting minutes on the mailing
> list ?
> >>>>> >> Do you need my help to complete/review ?
> >>>>> >> Maybe we can add it on the website too ?
> >>>>> >>
> >>>>> >> Thanks !
> >>>>> >> Regards
> >>>>> >> JB
>

Meeting Minutes from 2023-12-13 Iceberg Sync

2024-01-03 Thread Brian Olsen

Hey all,

Here are the meeting minutes from the meeting just before the holiday.

https://www.youtube.com/watch?v=OwyBlUi2CRc

- Highlights
- Encryption: Added
StandardEncryptionManager(https://github.com/apache/iceberg/pull/6884)
(Thanks, Gidon!)
- Views: Added support in REST catalog (Thanks, Eduard!)
- Views: Added support in Nessie catalog (Thanks, Ajantha!)
- Added support for spec ID in
rewrite_manifests(https://github.com/apache/iceberg/pull/9242)
procedure (Thanks, Pucheng!)
- Java: Flink - add watermark alignment support to FLIP-27 source
#8553(https://github.com/apache/iceberg/pull/8553) (Thanks, Peter!)
- Rust:
- Working toward a first release!
Documentation(https://github.com/apache/iceberg-rust/issues/114)
(how-to-install, examples,
how-to-release(https://github.com/apache/iceberg-rust/issues/81)) and
final tests(https://github.com/apache/iceberg-rust/issues/70) are
still pending.
- Merged the manifest-list and manifest
reader/writer(https://github.com/apache/iceberg-rust/pull/79) today
(Thanks, Jan, Renjie, Xuanwo)

- Releases
- PyIceberg
0.6.0(https://github.com/apache/iceberg-python/milestone/1) release
with write support
- Java 1.5.0 - January
- Delete executor cache (Spark)
- Flink 1.18 support (merged!) - compatibility broken from Flink
- FLIP-27 source by default
- View support in Spark 3.5
- Remove spark 3.2 folder (It was deprecated in the previous release) ?
- Java 2.0.0
- Upgrade to a newer JDK version (JDK 17)?
- Drop Hive from the source tree?
- ResolvingFileIO as the default
(https://github.com/apache/iceberg/pull/8272)
- Discussion
- REST catalog scan planning
- Sharding
- Pagination
- Require sharding?
- REST catalog append (and other operations?)
- REST catalog planning with other formats (Hive, Delta)
- Name mapping
- Fixing ID fallback
- Hive catalog views

AI-generated chapter summaries:
0:00 Introduction
6:09 Good progress being made on Rust implementation, targeting first
release soon
17:49 Targeting Spark view support for Iceberg Java 1.5 release in January
21:30 Discussed plans for Iceberg Java 2.0 to clean up deprecated code
27:57 Discussed REST API proposals for scan planning and other operations
46:37 Discussed best way to represent Iceberg views in Hive catalog

Topics
Rust Implementation Update

  - 6:09 Rust implementation making good progress, manifest list and
reader/writer code was merged recently
  - 6:23 Targeting first Rust release soon that can load tables and do
some scan planning
  - 7:04 Documentation will be on Rust side initially but can migrate later

Iceberg Python 0.6 Release Planning

  - 6:23 Working on getting write support in for 0.6 release
  - 8:08 Snapshot regeneration was a big step
  - 8:33 Call for any other items people want to get into 0.6

Iceberg Java 1.5 Release Planning

  - 8:49 Targeting release in January, about 2 months since last release
  - 17:49 Want to get Spark view support in, though won't block release on it
  - 20:39 Also want to drop Spark 3.2 support
  - 14:27 Planning to make Flip-27 the default file format

Iceberg Java 2.0 Planning

  - 21:30 Main goal is to clean up deprecated code
  - 22:54 Also want to remove unsafe ID fallback resolution
  - 24:09 Likely keep spec v3 work separate for Java 3.0 release

REST API Proposals

  - 27:57 Discussed scan planning proposal to break up planning into shards
  - 27:57 Allows parallel planning on server and client side
  - 27:57 Some concerns around pagination and making it fully stateless

01-03-2024 Community Sync is Cancelled for the holidays

2024-01-03 Thread Brian Olsen

Hello everyone,

The community sync that would generally happen today has been
cancelled as many are still on. holiday this week. I hope you all had
a fun holiday and new years! There's a lot of great things on the
docket for this year and I can't wait to meet them all with you!

Take care and safe travels if you're returning home!

- Bits

Re: [PROPOSAL] Improvement on our PR flows

2024-01-03 Thread Brian Olsen

+1

My team did an initial manual review of the Trino backlog and we found a
lot of value there.

1) We found 3 PRs that were ready for merge but accidentally missed the
boat for deployment.
2) We revived a few older PRs where there was actual interest from the
developers.
3) With the PR count down, maintainers could triage easier.

I think the manual part helped us catch a few things that we may not have
with a stale bot, but the cost/reward ratio was likely not worth it.

I do think it would be valuable to add in our current message a very clear
follow up action to that notifies a human to get involved. In Trino we had
a dedicated group to tag and one of us would respond and figure out who to
hand it off to.

On Wed, Jan 3, 2024 at 7:18 PM John Zhuge  wrote:

> +1 good idea
>
> On Wed, Jan 3, 2024 at 5:15 PM Renjie Liu  wrote:
>
>> +1 for this enhancement.
>>
>> On Thu, Jan 4, 2024 at 2:19 AM Jack Ye  wrote:
>>
>>> +1, sounds like a good idea to clean up stale PRs.
>>>
>>> -Jack
>>>
>>> On Wed, Jan 3, 2024 at 9:52 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 I definitely need something to keep emailing me, so I support this.

 On Wed, Jan 3, 2024 at 7:52 AM Jean-Baptiste Onofré 
 wrote:

> Hi guys,
>
> We have several examples where  we have some kind of "stale" PRs,
> either because we are waiting for a review, or we are waiting for
> changes from the contributor.
>
> We are already using two jobs around issues/PRs:
> - labeler to label PRs depending of the Iceberg modules change scope
> - stale to stale/close issues (we don't touch PRs in stale job today)
>
> In order to "improve" the PRs flow, I would like to propose the
> following:
>
> 1. We keep our labeler as it is. I propose to add
> .github/reviewers.yml to automatically add reviewers depending on the
> labels. It would look like (this is just an example, I will do a more
> concrete setup in a PR if there are no objection):
>
> labels:
>   - name: API
> reviewers:
>   - rdblue
>   - aokolnychyi
>   - Fokko
> exclusionList: []
>   - name: CORE
> reviewers:
>   - rdblue
>   - Fokko
>   - nastra
> exclusionList: []
>   - name: FLINK
> reviewers:
>   - nastra
> exclusionList: []
>...
>   fallbackReviewers:
> - rdblue
> - Fokko
> - nastra
> - jbonofre
>
> 2. We can update the stale job to add a reminder message to
> reviewer/contributor on PR. For instance, something like:
>
> name: Mark and close stale issues and pull requests
>
> on:
>   schedule:
>   - cron: '0 0 * * *'
>   workflow_dispatch:
>
> permissions: read-all
> jobs:
>   stale:
> runs-on: ubuntu-latest
> permissions:
>   issues: write
>   pull-requests: write
> steps:
> - uses: actions/stale@v9
>   with:
>   stale-issue-label: 'stale'
>   exempt-issue-labels: 'not-stale'
>   days-before-issue-stale: 180
>   days-before-issue-close: 14
>   stale-issue-message: >
> This issue has been automatically marked as stale because
> it has been open for 180 days
> with no activity. It will be closed in the next 14 days if
> no further activity occurs. To
> permanently prevent this issue from being considered
> stale, add the label 'not-stale',
> but commenting on the issue is preferred when possible.
>   close-issue-message: >
> This issue has been closed because it has not received any
> activity in the last 14 days
> since being marked as 'stale'
>   stale-pr-message: 'This pull request has been marked as
> stale due to 15 days of inactivity. It will be closed in 1 week if no
> further activity occurs. If you think that’s incorrect or this pull
> request requires a review, please simply write any comment. If closed,
> you can revive the PR at any time and @mention a reviewer or discuss
> it on the dev@iceberg.apache.org list. Thank you for your
> contributions.'
>   close-pr-message: 'This pull request has been closed due to
> lack of activity. If you think that is incorrect, or the pull request
> requires review, you can revive the PR at any time.'
> stale-pr-label: 'stale'
> days-before-pr-stale: 15
> days-before-pr-close: 7
> exempt-pr-labels: "pinned,security"
> operations-per-run: 100
>
> Thoughts ?
>
> PS: I did set up this on Apache Beam for example, and we did speed up
> the review and PR flows.
>
> Regards
> JB
>

>
> --
> John Zhuge
>

Re: [DISCUSS] Iceberg community summit

2024-01-12 Thread Brian Olsen

Hey Iceberg nation,

I would like to volunteer to be on the selection committee. I have a lot of
experience from my time working on the Trino Community. I helped run the
Trino Summit’s in 2021() and 2022 (
https://trino.io/blog/2022/11/21/trino-summit-2022-recap). The selection
committee was Martin Traverso, Manfred Moser, and myself. We always
believed that the primary goal of a successful summit was enablement and
bringing new faces into the project, while driving net new awareness with
the audiences that sponsors bring.

I’ve written on Iceberg (
https://trino.io/blog/2021/05/03/a-gentle-introduction-to-iceberg) and more
recently have been focused on refactoring documentation (
https://github.com/apache/iceberg/pull/8919) while simultaneously taking
stock of areas in the docs that need to be filled. I have also started work
on a blog series that revamps the messaging for Iceberg 101 and doing a
fair amount of research of where we need more discussion to lower the
barrier for Iceberg adoption. With the work on docs and research I’ve
looked at, I would love the opportunity to help build a speaker lineup that
discusses fundamental Iceberg architecture concepts and how those relate to
real problems that were solved. I would also aim to balance this with a
healthy focus on state-of-the-art improvements and roadmap discussions.

I hope you’ll consider me for the selection committee this year and either
way, I’m happy to help in any other way I can. Thanks JB and Ryan for your
continued work here.

On Fri, Jan 12, 2024 at 12:00 PM Alex Merced 
wrote:

> I'm glad to volunteer in anyway I can be helpful
>
> On Fri, Jan 12, 2024 at 12:54 PM Jack Ye  wrote:
>
>> Thanks for continuing the effort! I definitely would like to volunteer if
>> possible!
>>
>> Best,
>> Jack Ye
>>
>> On Fri, Jan 12, 2024 at 9:49 AM Ryan Blue  wrote:
>>
>>> Hi everyone,
>>>
>>> We've been having discussions about how to put together an Iceberg
>>> conference or summit for this year and one of the first steps is to put
>>> together a selection committee that will be responsible for choosing talks
>>> and guiding the process. Once we have a selection committee, we can put
>>> together the concrete proposal for the ASF and the Iceberg PMC to request
>>> the ability to use the name Iceberg.
>>>
>>> If you'd like to help and be part of the selection committee, please
>>> volunteer in a reply to this thread.
>>>
>>> Since we likely can't include everyone that volunteers, I propose that
>>> the PMC should choose the final committee from the set of people that
>>> volunteer. We'll leave this open for the next week or so to give people
>>> time.
>>>
>>> Ryan
>>>
>>>
>>> --
>>> Ryan Blue
>>>
>>
>
> --
>
> Alex Merced
>
> Developer Advocate
>
> alex.mer...@dremio.com
>

Re: [ANNOUNCE] New committer: Honah J.

2024-01-12 Thread Brian Olsen

Congratulations Honah! PyIcberg is gonna make Iceberg adoption soar!

On Fri, Jan 12, 2024 at 3:29 PM Yufei Gu  wrote:

> Congrats Honah!
>
> Yufei
>
>
> On Fri, Jan 12, 2024 at 1:25 PM Hussein Awala  wrote:
>
>> Congrats Honah!
>>
>> On Fri 12 Jan 2024 at 22:23, Micah Kornfield 
>> wrote:
>>
>>> Congrats!
>>>
>>> On Friday, January 12, 2024, Jack Ye  wrote:
>>>
 Congratulations! Thanks for all the work in python!

 Best,
 Jack Ye

 On Fri, Jan 12, 2024 at 1:11 PM Fokko Driesprong 
 wrote:

> On behalf of the Iceberg PMC, I'm happy to announce that Honah has
> accepted an invitation to become a committer on Apache (Py)Iceberg.
> Welcome, and thank you for your contributions!
>
> Kind regards,
> Fokko
>

Re: Process for creating new Proposals

2024-01-17 Thread Brian Olsen

+1 to issues and the suggested process

On Mon, Jan 15, 2024 at 3:12 AM Jean-Baptiste Onofré 
wrote:

> Hi Jan
>
> You are right, we quickly discussed about this during community
> meeting and on the mailing list.
>
> First, we discussed about using GitHub Discussions, but we agreed on
> using GitHub Issues.
> I like your proposal: creating a GitHub Issues with "Proposal:" prefix
> on the title sounds good to me.
> The discussions can happen on the GitHub Issues Comment.
>
> Regards
> JB
>
> On Mon, Jan 15, 2024 at 9:14 AM Jan Kaul 
> wrote:
> >
> > Hey all,
> >
> > I was wondering if the community decided on a standard way to create new
> > proposals. In the community meeting it sounds like there is a consensus
> > on using Github issues with a special "proposal" label. I think it would
> > also be great to decide on how the proposal process should look like so
> > that we could publish it on the website.
> >
> > The process could look something like this:
> >
> > 1. The community member that wants to create a proposal creates a Github
> > issues starting with "[Proposal]". The special mark makes it easier to
> > find issues intended as proposals. The proposal text can either be in
> > the issue description or in a Google doc that is being linked to from
> > the issue description.
> >
> > 2. If the initial proposal is accepted, the Github issue is labelled
> > "proposal". All issues with a "proposal" label can be found in a
> > dedicated "Proposals" project. The "Proposals" project is further
> > divided into different stages. Initially a proposal gets assigned the
> > "stage 0".
> >
> > 3. If the proposal fulfills certain requirements like detailed
> > specification, reference implementation, presented at a community
> > meeting, ... it can be decided to promote the proposal to a higher stage.
> >
> > 4. If the proposal reaches the final stage it is considered accepted and
> > a Github issue is created that tracks the actual implementation.
> >
> > I would be interested in your opinions. Let me know what you think.
> >
> > Best wishes,
> >
> > Jan
> >
>

Re: Proposal to fix the docs - this time it'll be different

2024-01-19 Thread Brian Olsen

Hey all,

Thanks all for your patience on the documentation refactor. We're nearing
the completion of the first (and most time-consuming) phase. I'd like to
reiterate why we're doing yet another docs refactor.

Issues addressed by the first documentation refactor from the first docs:
- We don't want the versioned docs or javadoc files to be tracked in the
main branch to avoid multiple copies of the docs being indexed in GitHub or
IDEs.
- We need a top level (non-versioned) Iceberg website that holds versioned
docs.

Issues addressed by this refactor along with the original issues are
addressed:
- The current docs release process is cumbersome and the code lives across
multiple repositories making it difficult to know where to contribute for
documentation: https://github.com/apache/iceberg-docs/blob/main/README.md.
- We wanted there to be an easy way to apply retroactive fixes to older doc
versions.
- A simple release process that should just be pushing a button and
reviewing a PR.
- Restyle Mkdocs default theme to look like the existing Iceberg theme.
- Fix broken links (there were a lot).

More design details:
https://docs.google.com/document/d/1WJXzcwC6isfoywcLY2lZ9gZN6i0JU1I2SIZmCkumbZc/edit?usp=sharing

Phase 1 Solution:

We've moved to an mkdocs-material theme, which has a rich ecosystem of
plugins that enable us to do the versioning under a single build process.
Code: https://github.com/apache/iceberg/tree/main/site
Preview: https://apache.github.io/iceberg/
We're currently running the preview of the new docs site on the main repo's
github pages link (no worries, https://iceberg.apache.org/ is currently
hosted on the ASF infra site in the iceberg-docs repo
https://github.com/apache/iceberg-docs/blob/main/.asf.yaml#L38-L39)

Improved Release process:

There's some questions I have to the community about how the release
process should go. I want to automate it as much as we can, while keeping a
control in place. My suggested approach would be to automate the first
three steps and have a 4th manual step for the release manager:

   1.  Invoke docs release workflow after main release (Unless we decide
   the steps below shouldn't be automated):
  1. Create a copy of the versioned documentation and build the
  Javadocs.
  2. Automerge these builds into the headless [docs branch](
  https://github.com/apache/iceberg/tree/docs) and the [javadoc branch](
  https://github.com/apache/iceberg/tree/javadoc) which are independent
  from the main branch. I don't see much benefit in reviewing the
docs branch
  that has already been reviewed going into main, or reviewing the
1k+ change
  for the javadocs, hence those would be auto-merged and easy to
roll back if
  something rendered improperly before the final PR.
  3. Create a pull request with an offline build of the documentation
  to verify everything renders correctly.
   2. The release manager validates the site/docs and merges that PR to
   finalize the docs release.

For more information see the README:
https://github.com/apache/iceberg/blob/main/site/README.md
The final merge to swap is here: https://github.com/apache/iceberg/pull/9520

I want to leave some time and capture a vote on when we'll be ready to move
forward, and address any concerns before we swap the sites.

Along with your vote, could you also let me know if you're in favor of
automating the release steps except for the final merge into the main
branch that updates the site version to use in the static pages.

Thanks all!

Bits

On Fri, Sep 29, 2023 at 12:20 AM Jean-Baptiste Onofré 
wrote:
>
> Hi Brian
>
> Thanks for the update. I will take a look.
>
> Regards
> JB
>
> Le ven. 29 sept. 2023 à 07:05, Brian Olsen  a
écrit :
>>
>> Hey All,
>>
>> I know it's been a while but the first phase of the docs refactor has
landed. I think it's at a decent point for everyone to take a look. To be
clear, this is not going to replace the existing website yet, but get the
first large landing of new docs to provide the initial proof of concept for
the build and make incremental changes until we are comfortable making the
swap. Once this is in and 1.4.0 goes out, I'll have to retroactively create
tags for each prior version of the documentation. While that's happening,
we can have someone else work on the look and feel of the website, to look
closer to our current site.
>>
>> https://github.com/apache/iceberg/pull/8659
>>
>> Thanks! Let me know if you have any questions!
>>
>> - Bits
>>
>> On Thu, Jul 27, 2023 at 4:10 PM Szehon Ho 
wrote:
>>>
>>> Hi
>>>
>>> I'm ok with putting things back in Iceberg repo, it gets more visbility
on prs.  I guess it used to be a bit distracting, but now with more
projects in Iceberg (pyiceberg, rust) we have to anyway

Meeting minutes 2024-01-24

2024-01-24 Thread Brian Olsen

Hey all,

We have quite a bit going on this first sync post holiday. Looking forward
to seeing the Java/Python releases (especially writes in PyIceberg) coming
soon. The switch to the new docs is about to happen. Also there's a really
great discussion at the end on GeoParquet.

Amazing work Iceberg nation!

Recording/Transcript: https://www.youtube.com/watch?v=GCHaFrPVbCQ
* Highlights
* Added Parquet and Avro encryption support (Thanks, Gidon!)
* Added Spark support for reading, dropping, renaming views (Thanks,
Eduard!)
* Added initial setup for Kafka Connect (Thanks, Bryan!)
* Added Spark support for delete file granularity (Thanks, Anton!)
* Removed support for Spark 3.2 (Thanks, Ajantha!)
* Parallelized file footer reads in add_files procedure (Thanks, Manu!)
* Docs: Updated new docs to look like the existing docs (Thanks, Bits!)
* PyIceberg
* Added write support in PyIceberg (Thanks, Fokko!)
* Added commit support for PyIceberg Hive (Thanks, Kevin!)
* Added commit support for SQL catalog (Thanks, Sung!)
* Added Parquet name mapping support in PyIceberg (Thanks, Sung!)
* Rust: Added expressions and basic scan planning (Thanks, Renjie!)
* [
https://rust.iceberg.apache.org/](https://rust.iceberg.apache.org/)
* Releases
* Docs refactor (Bits)
* Relevant email thread: [
https://lists.apache.org/thread/dv2x9n6pkykmstkhrw1fx934qjtxnofy](https://lists.apache.org/thread/dv2x9n6pkykmstkhrw1fx934qjtxnofy)
* Docs: [
https://github.com/apache/iceberg/tree/main/site](https://github.com/apache/iceberg/tree/main/site)
* Need votes, will aim to do cutover prior to 1.5.0 and manual
release
* Java 1.5.0 ([milestone](https://github.com/apache/iceberg/milestone/37
))
* Kafka Connect
* CREATE VIEW
* Default values in copy-on-write MERGE (Spark 3.4)
* View support on JDBC, Hive, REST, …
* Python 0.6.0
* [
https://github.com/apache/iceberg-python/milestone/1](https://github.com/apache/iceberg-python/milestone/1)
* Write support
* Discussion
* Spatial extensions to Iceberg
* Prior art: [Geoparquet](
https://github.com/opengeospatial/geoparquet), [Geoarrow](
https://github.com/geoarrow/geoarrow), [Havasu](
https://github.com/wherobots/havasu/)
* [WKB](
https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary)
- Well integrated, not the most performant/space considerate
* Better performance would be closer to geoarrow, but will be less
familiar to developers and anyone trying to integrate with Geo format
* How do we actually store these we will dive into in a follow up
discussion facilitated on the dev list.

[DISCUSS] Release new Iceberg docs site in the main repository

2024-01-26 Thread Brian Olsen

Hey everyone,

As discussed during the community sync, I'd like to get a vote on moving
forward with the documentation. I have created a PR (
https://github.com/apache/iceberg/pull/9520) that references the changes
that have happened up to this point.

   - Simpler contribution by collocating the website and documentation in
   the same repository.
   - We don't want the versioned docs or javadoc files to be tracked in the
   main branch to avoid multiple copies of the docs being indexed in GitHub
   or
   IDEs.
   - We need a top level (non-versioned) Iceberg website that links
   versioned
   docs and contains evergreen constructs.
   - The current docs release process is cumbersome and the code lives
   across
   multiple repositories making it difficult to know where to contribute for
   documentation: https://github.com/apache/iceberg/issues/8151.
   - We wanted there to be an easy way to apply retroactive fixes to older
   doc
   versions.
   - A simple release process can now be automated once we validate things
   work well manually by starting a workflow and reviewing a PR.
   - Restyle Mkdocs default theme to look like the existing Iceberg theme.
   - Fix broken links (there were a lot).


It would be great to get a quick vote on moving forward with this process.
Thanks!

- Bits

Re: [PROPOSAL] Create user mailing list ?

2024-01-30 Thread Brian Olsen

I do like the idea of making the Slack threads available through the
mailing list. Is there a slack bot you have in mind? How would the threads
appear in the mailing list?

On Tue, Jan 30, 2024 at 7:13 AM Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> If we have a few user questions on the dev mailing list, we have quite
> a number on Slack.
> It's completely fine but not easy to search the questions and find the
> concrete answer.
>
> As most other Apache projects do, I propose to create a user mailing
> list to invite people to ask questions and request help.
> This mailing list can be browsed and searched on
> https://lists.apache.org/ and can be moderated.
> We can use slackbot to create a "bridge" between slack and the user
> mailing list.
>
> Thoughts ?
>
> Regards
> JB
>

Re: [Discussion] Iceberg 1.5.0 release

2024-02-01 Thread Brian Olsen

We’re adding on the url and then the docs release will be ready! Should be
merged today and the. I’ll explain the two steps you’ll need to do and be
standing by during for the docs release.

Thanks Ajantha!

On Thu, Feb 1, 2024 at 8:18 AM Jean-Baptiste Onofré  wrote:

> Hi Ajantha
>
> About the JDBC catalog view support, I updated the PR, it's now ready
> for a new review round (I pinged Eduard).
>
> Regards
> JB
>
> On Thu, Feb 1, 2024 at 3:11 PM Ajantha Bhat  wrote:
> >
> > Thanks to everyone who has added the PRs to the 1.5.0 milestone and got
> it merged.
> >
> > https://github.com/apache/iceberg/milestone/37
> > Looks like we have 2 more things needed for release from the above
> milestone tag.
> > Mostly they need review.
> > - JDBC catalog view support (https://github.com/apache/iceberg/pull/9487
> )
> > - Spark 3.5 Support executor cache locally (
> https://github.com/apache/iceberg/pull/9563)
> >
> > Also this is a last call for PRs, let me know if anything else is
> urgently needed for 1.5.0 release.
> >
> > - Ajantha
> >
> > On Wed, Jan 3, 2024 at 11:33 AM Jean-Baptiste Onofré 
> wrote:
> >>
> >> Hi Ajantha
> >>
> >> It sounds good to me. I would like to submit PR for views support on
> JDBC Catalog. But I guess it will take time for review.
> >>
> >> If we can wait at least a week before starting the release process it
> would be great.
> >> You will need help from a PMC member to complete some tasks.
> >>
> >> I saw also the community meeting has been canceled. It would have been
> good to have a message on the mailing list to explain. The community could
> have make it even if a few people are still on vacation. I would be ok to
> take notes and record.
> >>
> >> Regards
> >> JB
> >>
> >> Le mer. 3 janv. 2024 à 06:44, Ajantha Bhat  a
> écrit :
> >>>
> >>> Hi all,
> >>>
> >>> I would like to volunteer to be the release manager for Iceberg 1.5.0.
> >>>
> >>> We have decided to release 1.5.0 in Jan during the last community sync
> (as per our quarterly release cycle).
> >>> Since today's community sync got cancelled (not sure why), I don't
> want to wait another 3 weeks to start the release process.
> >>>
> >>> Iceberg 1.5.0 milestone is already created
> https://github.com/apache/iceberg/milestone/37.
> >>> If you would like to include your open PR for the 1.5.0 release,
> please add this milestone (please tag me if you don't have permission to
> add it).
> >>>
> >>> If we start the process by now, I hope we can have a release by the
> end of this month.
> >>>
> >>> Note:
> >>> Iceberg views APIs are merged and few catalogs support it now.
> >>> Query Engines (like Dremio, Trino) want to start integrating this.
> Hence, looking forward to the release.
> >>>
> >>> Thanks,
> >>>
> >>> Ajantha
>

Re: [DISCUSS] Release new Iceberg docs site in the main repository

2024-02-01 Thread Brian Olsen

Will do! I also have todos to make iceberg-docs read only

On Thu, Feb 1, 2024 at 7:52 AM Ajantha Bhat  wrote:

> +1,
>
> Please also update the
> https://iceberg.apache.org/how-to-release/#documentation-release after
> that.
> I will try it out for the 1.5.0 release.
>
>
> - Ajantha
>
> On Wed, Jan 31, 2024 at 4:26 AM Jack Ye  wrote:
>
>> Sorry for the late vote, +1 and thanks for the great work!
>>
>> -Jack
>>
>> On Tue, Jan 30, 2024 at 7:22 AM Eduard Tudenhoefner 
>> wrote:
>>
>>> +1, thanks for working on this Brian.
>>>
>>> On Tue, Jan 30, 2024 at 12:02 AM Ryan Blue  wrote:
>>>
>>>> It looks like we have lazy consensus, so we'll go ahead with the
>>>> switch-over so we don't need to go through the old process for the 1.5.0
>>>> release.
>>>>
>>>> Thanks to Brian for pushing this forward, and to everyone that helped
>>>> review and get this change ready! I think it will be a positive step for
>>>> improving our docs!
>>>>
>>>> On Mon, Jan 29, 2024 at 11:13 AM Yufei Gu  wrote:
>>>>
>>>>> +1 Thanks Brian!
>>>>> Yufei
>>>>>
>>>>>
>>>>> On Mon, Jan 29, 2024 at 7:57 AM Fokko Driesprong 
>>>>> wrote:
>>>>>
>>>>>> I did some reviews of the PRs that led up to this, and I think the
>>>>>> new site is much easier to maintain and deploy. +1 from my end :)
>>>>>>
>>>>>> Cheers, Fokko
>>>>>>
>>>>>> Op ma 29 jan 2024 om 15:15 schreef Jean-Baptiste Onofré <
>>>>>> j...@nanthrax.net>:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> On Fri, Jan 26, 2024 at 11:40 PM Brian Olsen <
>>>>>>> bitsondata...@gmail.com> wrote:
>>>>>>> >
>>>>>>> > Hey everyone,
>>>>>>> >
>>>>>>> > As discussed during the community sync, I'd like to get a vote on
>>>>>>> moving forward with the documentation. I have created a PR (
>>>>>>> https://github.com/apache/iceberg/pull/9520) that references the
>>>>>>> changes that have happened up to this point.
>>>>>>> >
>>>>>>> > Simpler contribution by collocating the website and documentation
>>>>>>> in the same repository.
>>>>>>> > We don't want the versioned docs or javadoc files to be tracked in
>>>>>>> the
>>>>>>> > main branch to avoid multiple copies of the docs being indexed in
>>>>>>> GitHub or
>>>>>>> > IDEs.
>>>>>>> > We need a top level (non-versioned) Iceberg website that links
>>>>>>> versioned
>>>>>>> > docs and contains evergreen constructs.
>>>>>>> > The current docs release process is cumbersome and the code lives
>>>>>>> across
>>>>>>> > multiple repositories making it difficult to know where to
>>>>>>> contribute for
>>>>>>> > documentation: https://github.com/apache/iceberg/issues/8151.
>>>>>>> > We wanted there to be an easy way to apply retroactive fixes to
>>>>>>> older doc
>>>>>>> > versions.
>>>>>>> > A simple release process can now be automated once we validate
>>>>>>> things work well manually by starting a workflow and reviewing a PR.
>>>>>>> > Restyle Mkdocs default theme to look like the existing Iceberg
>>>>>>> theme.
>>>>>>> > Fix broken links (there were a lot).
>>>>>>> >
>>>>>>> > It would be great to get a quick vote on moving forward with this
>>>>>>> process. Thanks!
>>>>>>> >
>>>>>>> > - Bits
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>

Re: [PROPOSAL] Create user mailing list ?

2024-02-02 Thread Brian Olsen

I agree with Fokko, I believe the added value would be if the user mailing
list becomes an extension of the Slack channel that archives our
discussions there and makes them searchable to Google, ideally in a decent
format though I am nervous on how stack traces and other types of media
might lose their structure based on the limitations of the mailing server.

I’m reserving my vote until we have some new info from JB, who is busy with
the JDBC vote support so I would suggest we hold off until then.

Whether or not we have the Slack stuff, I think we should constantly
evaluate the state of support and of adding or removing channels makes
sense.

On Fri, Feb 2, 2024 at 4:09 AM Fokko Driesprong  wrote:

> ±0 for having a user mailing list.
>
> I don't believe that having more channels will lead to better support. I
> agree that the archiving capabilities of Slack are limited, and the search
> is sub-optimal. But we should also make sure that the questions asked are
> also integrated into the documentation. The new website will also have
> search capabilities, which will also make the content easier to find.
>
> I'm not against it if people feel like there is added value in it, just
> want to make sure that we as a community make sure that all the channels
> are being monitored if we go down that route. And how the Slackbot would
> synchronize the two channels.
>
> Kind regards,
> Fokko
>
> Op di 30 jan 2024 om 23:55 schreef Jack Ye :
>
>> +1 for having a user mailing list.
>>
>> Do we envision the slack bot to be used for people in slack to
>> participate in user list conversations, or the other way around, or both?
>>
>> Allowing people in slack to participate in user list conversations seems
>> pretty achievable. Allowing people in the user list to participate in slack
>> conversations means to forward all conversations in slack to the user list
>> and might be quite noisy. But the advantage is that people would be able to
>> search for those Slack questions and answers on the internet. Maybe we can
>> consider a separated "slack" mailing list for that purpose.
>>
>> Best,
>> Jack Ye
>>
>> On Tue, Jan 30, 2024 at 8:16 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> AFAIR, some ASF projects are using slackbot to receive users requests
>>> from the mailing list and can send messages to the mailing list.
>>>
>>> Let me do a quick research and get back to you.
>>>
>>> Regards
>>> JB
>>>
>>> On Tue, Jan 30, 2024 at 3:14 PM Brian Olsen 
>>> wrote:
>>> >
>>> > I do like the idea of making the Slack threads available through the
>>> mailing list. Is there a slack bot you have in mind? How would the threads
>>> appear in the mailing list?
>>> >
>>> > On Tue, Jan 30, 2024 at 7:13 AM Jean-Baptiste Onofré 
>>> wrote:
>>> >>
>>> >> Hi guys,
>>> >>
>>> >> If we have a few user questions on the dev mailing list, we have quite
>>> >> a number on Slack.
>>> >> It's completely fine but not easy to search the questions and find the
>>> >> concrete answer.
>>> >>
>>> >> As most other Apache projects do, I propose to create a user mailing
>>> >> list to invite people to ask questions and request help.
>>> >> This mailing list can be browsed and searched on
>>> >> https://lists.apache.org/ and can be moderated.
>>> >> We can use slackbot to create a "bridge" between slack and the user
>>> >> mailing list.
>>> >>
>>> >> Thoughts ?
>>> >>
>>> >> Regards
>>> >> JB
>>>
>>

Re: nanosecond support - renewed effort

2024-02-02 Thread Brian Olsen

Nothing important to add just wanted to say thanks for you and Eric’s
efforts here! I think this is going to be critical for greater Iceberg
adoption in the finance sector.

On Fri, Feb 2, 2024 at 1:48 PM Jacob Marble 
wrote:

> Good morning,
>
> Here at InfluxData, we've dusted off the nanosecond timestamp PR
> . Thanks to Eric Gillespie
> for joining our team to help on this effort!
>
>
> --
> Jacob Marble
> 🇺🇸 🇺🇦
>

New Iceberg docs site is up we need your help

2024-02-04 Thread Brian Olsen

Hey all,

I'm happy to announce that the new Apache Iceberg site is officially
deployed to https://iceberg.apache.org!

Thanks to all who participated in design discussions and reviews along the
way. I will be fixing up the automation and documentation and provide
instructions on how to update, validate, and deploy the documentation.

Few things to note, there are likely a few links that were changed or
updated from before. Please report any of these in The Iceberg Issues:
https://github.com/apache/iceberg/issues. Thanks to Kevin Liu for already
reporting the Catalog page issues:
https://github.com/apache/iceberg/pull/9642. Please keep a lookout for
these issues and feel free to report any styling and CSS issues on the new
site here: https://github.com/apache/iceberg/issues/9643

Please note that we will no longer be submitting to the
https://github.com/apache/iceberg-docs repository. I will submit a ticket
to ASF INFRA to archive the repository and mark as read-only. There are
very few pull requests that should be relatively simple to migrate over and
I will also be following up with those individuals.

Thank you all for your help and patience as we roll this out!

-Bits

Re: New Iceberg docs site is up we need your help

2024-02-04 Thread Brian Olsen

Oh and another thing to mention, we only have versions since 1.3.0. This
will get resolved within a month or two. If anyone is interested in
applying their git wizardry to rebasing the previous docs while
simultaneously applying the new fixes please let me know! ;)

On Sun, Feb 4, 2024 at 1:44 PM Brian Olsen  wrote:

> Hey all,
>
> I'm happy to announce that the new Apache Iceberg site is officially
> deployed to https://iceberg.apache.org!
>
> Thanks to all who participated in design discussions and reviews along the
> way. I will be fixing up the automation and documentation and provide
> instructions on how to update, validate, and deploy the documentation.
>
> Few things to note, there are likely a few links that were changed or
> updated from before. Please report any of these in The Iceberg Issues:
> https://github.com/apache/iceberg/issues. Thanks to Kevin Liu for already
> reporting the Catalog page issues:
> https://github.com/apache/iceberg/pull/9642. Please keep a lookout for
> these issues and feel free to report any styling and CSS issues on the new
> site here: https://github.com/apache/iceberg/issues/9643
>
> Please note that we will no longer be submitting to the
> https://github.com/apache/iceberg-docs repository. I will submit a ticket
> to ASF INFRA to archive the repository and mark as read-only. There are
> very few pull requests that should be relatively simple to migrate over and
> I will also be following up with those individuals.
>
> Thank you all for your help and patience as we roll this out!
>
> -Bits
>
>

Meeting minutes 2024-02-14

2024-02-15 Thread Brian Olsen

Hey Iceberg Nation,

Here are the minutes from yesterday's meeting. Let's deprecate some
catalogs!

-Bits

Transcription/Recording: https://www.youtube.com/watch?v=uAQVGd5zV4I

* Highlights
* Spark: [Added locality executor cache for delete files](
https://github.com/apache/iceberg/pull/9563) (Thanks, Anton!)
* Docs: Moved to new docs build (Thanks, Bits!)
* REST: Added more documentation on sharing credentials (Thanks, Dan!)
* Spark: Added support for delete manifest rewrites (Thanks, Anton!)
* Spark: Fixed aggregate pushdown for nested fields (Thanks, Amogh!)
* Releases
* Iceberg 1.5.0
* PyIceberg 0.6.0
* iceberg-rust 0.2.0
* Discussion
* Iceberg Summit
* 30 second update: [Nanoseconds feature](
https://github.com/apache/iceberg/issues/8657) will be active again soon
* Draft for [Materialized View Spec](
https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing
)
* Supported catalogs
* Catalog confusion
* JDBC/Sql upgrades
* Hive table locking in PyIceberg
* Flink table maintenance - planning to start working on this feature,
any input is welcome
* Quick compaction for small files and manifest files
* Periodic orphan file removal
* Periodic snapshot expiration
* Pagination and table planning API support in REST spec
* [
https://github.com/apache/iceberg/pull/9660](https://github.com/apache/iceberg/pull/9660):
need consensus of pageToken empty value
* [
https://github.com/apache/iceberg/pull/9695](https://github.com/apache/iceberg/pull/9695)
need consensus around serializing partition tuple values (more details in [
https://github.com/apache/iceberg/pull/9717](https://github.com/apache/iceberg/pull/9717))

* Discuss supporting policy decisions in REST spec
* [Issue-9597](https://github.com/apache/iceberg/issues/9597): spec
change. Add a new `task-type` field to the file scan task JSON
serialization.

Re: Meeting minutes 2024-02-14

2024-02-15 Thread Brian Olsen

Apologies, I said this in jest, please refer to the notes and sync
discussion for more details.

On Thu, Feb 15, 2024 at 7:55 AM Jean-Baptiste Onofré 
wrote:

> Hi Brian,
>
> Thanks for sharing!
>
> Imho, "deprecating some catalogs" is a bit premature and scary as we
> are talking only about DynamodbCatalog. For our users, I think it
> would be better to wait for formal discussion on the mailing list (and
> eventually vote) before announcing this kind of message.
> Thanks :)
>
> Regards
> JB
>
> On Thu, Feb 15, 2024 at 2:36 PM Brian Olsen 
> wrote:
> >
> > Hey Iceberg Nation,
> >
> > Here are the minutes from yesterday's meeting. Let's deprecate some
> catalogs!
> >
> > -Bits
> >
> > Transcription/Recording: https://www.youtube.com/watch?v=uAQVGd5zV4I
> >
> > * Highlights
> > * Spark: [Added locality executor cache for delete files](
> https://github.com/apache/iceberg/pull/9563) (Thanks, Anton!)
> > * Docs: Moved to new docs build (Thanks, Bits!)
> > * REST: Added more documentation on sharing credentials (Thanks,
> Dan!)
> > * Spark: Added support for delete manifest rewrites (Thanks, Anton!)
> > * Spark: Fixed aggregate pushdown for nested fields (Thanks, Amogh!)
> > * Releases
> > * Iceberg 1.5.0
> > * PyIceberg 0.6.0
> > * iceberg-rust 0.2.0
> > * Discussion
> > * Iceberg Summit
> > * 30 second update: [Nanoseconds feature](
> https://github.com/apache/iceberg/issues/8657) will be active again soon
> > * Draft for [Materialized View Spec](
> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing
> )
> > * Supported catalogs
> > * Catalog confusion
> > * JDBC/Sql upgrades
> > * Hive table locking in PyIceberg
> > * Flink table maintenance - planning to start working on this
> feature, any input is welcome
> > * Quick compaction for small files and manifest files
> > * Periodic orphan file removal
> > * Periodic snapshot expiration
> > * Pagination and table planning API support in REST spec
> > * [
> https://github.com/apache/iceberg/pull/9660](https://github.com/apache/iceberg/pull/9660):
> need consensus of pageToken empty value
> > * [
> https://github.com/apache/iceberg/pull/9695](https://github.com/apache/iceberg/pull/9695)
> need consensus around serializing partition tuple values (more details in [
> https://github.com/apache/iceberg/pull/9717](https://github.com/apache/iceberg/pull/9717)
> )
> > * Discuss supporting policy decisions in REST spec
> > * [Issue-9597](https://github.com/apache/iceberg/issues/9597): spec
> change. Add a new `task-type` field to the file scan task JSON
> serialization.
>

Re: Support permission concepts in REST spec

2024-02-27 Thread Brian Olsen

This may potentially be another thread, but I want to see if we can avoid
excess work/design discussions by utilizing an open policy engine rest api (
https://www.openpolicyagent.org/docs/latest/rest-api/
)that’s already defined and used in Trino now (
https://trino.io/docs/current/security/opa-access-control.html).

I know the api may be overkill for simple permissions concepts, but this
could deflect the need for iceberg to own and manage any of the security
primitives as I’ve seen it mentioned that we don’t want to have too much
focus on these concepts.

OPA seems like modern Ranger but for all apps, to me. What do you all think?

On Mon, Feb 26, 2024 at 1:34 PM Jack Ye  wrote:

> Thank you Ryan for the detailed suggestions!
>
> So far, it sounds like there are in general 2 types of policy decisions:
> 1. ones that would fail an execution if not satisfied, e.g. check
> constraints, protected column, read/write access to storage, etc.
> 2. ones that would amend an execution plan, e.g. column and row filters,
> dynamic column masking, etc.
>
> For the second type, there is another potential alternative direction I
> found some systems are using. Let me also put it here, curious what people
> think.
>
> Many of these decisions can be translated together to some sort of view on
> top of a table. Consider user A has permission on table1, column c1 c2,
> sha1 hash mask on email column, row filter age > 21. This can be translated
> into a decision that user A can access a view *SELECT c1, c2, sha1(email)
> FROM table1 WHERE age > 21*.
>
> Given that we already have an Iceberg view spec, the catalog can
> potentially dynamically render such a multi-dialect view, so that table1
> becomes a view "*SELECT c1, c2, sha1(email) FROM temp_table_12345 WHERE
> age > 21*" where *temp_table_12345* becomes the actual underlying table
> for enforcing type 1 decisions. (temp table is just one way to implement it
> as an example here, more design consideration is needed)
>
> This approach seems to be more flexible in the sense that catalogs can
> develop many different styles of policy without the need for Iceberg to
> standardize on something like expression semantics, since the Iceberg view
> is now a standard for expressing the decision.
>
> Any thoughts?
>
> -Jack
>
>
>
>
> On Sun, Feb 25, 2024 at 11:58 AM Ryan Blue  wrote:
>
>> I think this is a good idea, but is definitely an area where we need to
>> be clear about how it would work for people to build with it successfully.
>>
>> All it takes is one engine to ignore these as the security provided is no
>> longer applicable.
>>
>> You’re right that security depends on knowing that the client is going to
>> enforce the requirements sent by the catalog. That just means that the
>> catalog either needs to deny access (401/403 response) or have some
>> pre-established trust in the identity that is loading a table (or view).
>>
>> The current authentication mechanisms that we’ve documented have ways to
>> do this. For example, if you’re using a token scheme you can put additional
>> claims in the auth token when the client is trusted to enforce fine-grained
>> access. To establish trust, you can either manually create a token for a
>> compute service with the trust selected or we could add another OAuth2
>> scope to request it when connecting compute engines to catalogs. Either
>> way, we already have mechanisms to establish trust relationships between
>> engines and catalogs so this would just be an additional capability.
>>
>> I worry a little bit about putting security features into the REST API
>> that require the execution engine and catalog to agree on semantics and
>> execution.
>>
>> I agree in the general case, but I think there are narrow cases where we
>> are already handling this problem and solving those is incredibly useful. I
>> think a critical design constraint is that this extension should be used to
>> pass requirements — the result of policy decisions — and NOT be used to
>> pass policy itself. (And, I would change the proposed policy field in
>> REST responses to requirements or similar to make this clear.)
>>
>> Policy is complicated and it is modelled and enforced differently across
>> products. Databases all have their own rules. For instance, in some schemes
>> database SELECT cascades to table SELECT, while others check only the table
>> resource for SELECT permission. I think we clearly don’t want to try to
>> normalize or force a standard on this space. Instead, we want catalogs and
>> access control systems to have the model that they choose. The REST
>> protocol should communicate the decisions made by those schemes.
>>
>> That significantly narrows the scope of this feature. Starting with
>> fields that can or can’t be read and filters that must be applied is a
>> great start that covers a large number of use cases. And we already have
>> clear semantics for Iceberg filters and for column projection. We would
>> still need to specify additional

Re: Gravitino an Iceberg REST catalog service

2024-03-01 Thread Brian Olsen

My attempt to consolidate a list of goals, anti patterns , and impl details
mentioned since this discussion was brought up at the last Iceberg sync.
Tried to roughly capture who mentioned these things so we can follow up if
needed. Hopefully this can serve as a basis for the design discussion.

Goals:

- Remove the initial burden of choice of which REST implementation from new
users getting started with Iceberg (Russel S)
- Cut down on the supported catalogs that are no longer in use (e.g.
DynamoDB) or never intended for production (e.g. Hadoop) to minimize
maintenance lower variability, and lower the burden of choice on Iceberg
users. (Blue)
- Simplify plugging in your own catalog so the Iceberg project isn’t
responsible for maintaining and testing a bunch of dialects. (Blue).
- Aim for a REST catalog centric future and continue to remove Iceberg
support where it makes sense. (Russell/Jack Ye/Blue)
- Use this as a test dependency for the Iceberg project (Jack/Russell)
- Make this an MVP production grade catalog, assuming that whatever we do
put out there will end up being used as production anyways. (Blue/Dan)
- Keep the responsibilities the REST implementation as light as possible.
(Blue)
- Support HTTP(S) protocol, the service will act as a load balancer + proxy
to the JDBC backend. (Blue)
- Container image + k8s installation (Blue)
- Use for Iceberg education and evaluation (Bits)
- Use as a blueprint for designing you own Implementation (JB)

Anti patterns:

- Avoid becoming the Hive Metastore project, where we support every use
case.
- Don’t support data governance cases like lineage. (Dan)
- Don’t support metrics reporting. (Blue/Dan)
- Don’t support security. (Blue)
- Don’t support a wide range of protocols outside of HTTP(S) (Dan)
- In general, avoid spending time integrating with whatever runtime a given
company uses that removes focus from the core project goals and spec. (Dan)
- Don’t be overly opinionated with tool choices. (Dan)

Implementation ideas:

- apache/iceberg-catalog repository, with all of the catalog impls moved
and maintained there as well. (Blue/Dan/Jack/JB/Russel)
- A catalog implementation per JDBC backend. (Blue)
- Servlet like Tomcat or Spring to run / package the service. (Blue)

On Fri, Mar 1, 2024 at 2:54 AM Jean-Baptiste Onofré  wrote:

> Hi Renjie,
>
> maybe I wasn't clear, sorry about that: the target is really both ref
> impl (where we can test different Iceberg parts like we do with the
> InMemoryCatalog, JdbcCatalog, etc) and ready to go service for users
> (simple but to start with).
>
> But we can't prevent the community from working on a production grade
> catalog. The point is: if it's not in Iceberg, then it gonna be
> elsewhere (another ASF project, vendor project, whatever). This is OK
> as soon as we have a reference implementation in Iceberg. That's the
> min we should guarantee imho.
> For instance, for the JAXRS spec, the ref implementation is CXF-RS,
> but there are other implementation. The same for OSGi Blueprint, the
> ref implementation is in Apache Aries (aries-blueprint).
>
> My proposal is really a simple ref imp in Iceberg (submodule or
> separate repo, both are OK for me even if I have a preference for
> separate repo to keep things clean and different lifecycle as we do
> for iceberg-rust or iceberg-python),
>
> That said, I don't see why we could not have iceberg-catalog repo with
> a ref impl that evolves to something production ready. Observability,
> scaling, pluggable backend, etc can be implemented there and it would
> be a great addition for Iceberg with new contributors from the
> community I'm sure. Separated repo would make this doable imho,
> Iceberg still focus on spec.
>
> Regards
> JB
>
> On Fri, Mar 1, 2024 at 9:24 AM Renjie Liu  wrote:
> >
> > Hi:
> >
> > I think one thing missing in the discussion is that, if the iceberg
> community wants to maintain a rest catalog service, what's the target use
> case? Different target use cases may lead to different directions.
> >
> > If it's mainly designed for first time users to play or experience with
> rest catalog, then maybe we just need a submodule in java repo or a
> test-jar would be enough.
> >
> > If it's targeted toward production usage, things get complicated. There
> are too many things to think about, such as using different storage
> backend, monitoring, ha, scalability etc. What's more, in an enterprise
> iceberg rest catalog usually is only part of a data platform, there are
> many other things involved. In this case, I'm skeptical about the actual
> value of a rest catalog server, and a spec or a library would be more
> valuable.
> >
> > On Fri, Mar 1, 2024 at 3:49 PM Jean-Baptiste Onofré 
> wrote:
> >>
> >> Hi Fokko
> >>
> >> If service means the actual runtime service, I partially agree.
> >>
> >> I would love to see REST Catalog API the "central cornerstone" used in
> >> iceberg-java, pyiceberg, etc. So I think we should provide the
> >> resources for an user to bootst

Re: New committer: Bryan Keller

2024-03-05 Thread Brian Olsen

Hip hip 🥳 hooray 🎉!!

Congrats Bryan! Thanks for all your contributions past and future!

On Tue, Mar 5, 2024 at 6:51 AM Zheng Hu  wrote:

> Congrats, Bryan!
>
> On Tue, Mar 5, 2024 at 6:38 PM Renjie Liu  wrote:
>
>> Congratulations Bryan.
>>
>> On Tue, Mar 5, 2024 at 17:59 Honah J.  wrote:
>>
>>> Congratulations Bryan!
>>>
>>> On Tue, Mar 5, 2024 at 1:40 AM Ajantha Bhat 
>>> wrote:
>>>
 Congratulations Bryan.

 On Tue, Mar 5, 2024 at 2:50 PM Eduard Tudenhoefner 
 wrote:

> Congrats Bryan, very well deserved!
>
> On Tue, Mar 5, 2024 at 9:37 AM Hussein Awala  wrote:
>
>> Congrats Bryan!
>>
>> On Tue, Mar 5, 2024 at 9:20 AM Fokko Driesprong 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> The Project Management Committee (PMC) for Apache Iceberg has
>>> invited Bryan Keller to become a committer and we are pleased to 
>>> announce
>>> that he has accepted.
>>>
>>> Bryan was contributing to Iceberg before it was even open-source,
>>> did a lot of work on the topic of metadata generation, and is now 
>>> leading
>>> the effort of migrating the Kafka Connect integration into OSS Iceberg.
>>>
>>> Being a committer enables easier contribution to the project since
>>> there is no need to go via the patch submission process. This should 
>>> enable
>>> better productivity. A PMC member helps manage and guide the direction 
>>> of
>>> the project.
>>>
>>> Please join me in congratulating Bryan.
>>>
>>> Cheers,
>>> Fokko
>>>
>>

Meeting Minutes 2024-03-06

2024-03-06 Thread Brian Olsen

Hey Iceberg Nation,

Here are the meeting minutes from today's meeting.

In today's sync, we welcome two committers, Bryan Keller and Renjie Liu!
Bryan has contributed a great deal of work to the project (
https://github.com/apache/iceberg/commits/main/?author=bryanck) over the
years, including his most recent contribution, the official Kafka Connect
integration (https://github.com/apache/iceberg/commits/main/kafka-connect).
Renjie has been the leading contributor to the iceberg-rust implementation (
https://github.com/apache/iceberg-rust/commits/main/?author=liurenjie1024).
Congratulations to both of you!

The latest 1.5.0 release candidate is passing voting right now, and we dive
into some of the changes to expect and call out the contributions from
various individuals.

Iceberg Summit is on the horizon and the CFP is now open!, please consider
submitting a talk (https://sessionize.com/iceberg-summit-2024/).

The second half of the sync then covers a thorough discussion around
heavily debated implementation details of materialized views that aligned a
lot of the discussion points happening on the mailing list, but didn't come
to a final consensus. Jack Ye will own providing a summary and facilitate
further discussions moving forward to ensure we're discussing the same
concepts and coming to a convergence of the different viewpoints.

Transcription/Recording: https://youtu.be/d4dEgAa1vKk

* Highlights
* New committer, Bryan Keller! ([Congrats!](
https://lists.apache.org/thread/361wozk0rpos8tmgfp2t17ygskm83m87))
* New committer, Renjie Liu!
* Virtual Iceberg Summit 2024. ([Announcement](
https://lists.apache.org/thread/9w47vqzfz6byzjpx90nhvrg366c58y1m), [CFP](
https://sessionize.com/iceberg-summit-2024/)) (Thanks, JB!)
* Added DataFile/DeleteFiles to REST spec (Thanks, Drew!)
* Added pagination to the REST spec (Thanks, Rahil!)
* Added view support to JDBC catalog (Thanks, JB!)
* Added EncryptingFileIO (Thanks, Gidon!)
* Fixed snapshot log with REPLACE TABLE (Thanks, Eduard!)
* Releases
* Iceberg 1.5.0
* Voting on the next release candidate for 1.5.0 ([Vote thread](
https://lists.apache.org/thread/syp2hwp53rhromt4711w709dfq4cmvcb))
* SHOW TABLES behavior
* InMemoryCatalog list behavior
* Capabilities in REST spec
* Discussion
* Materialized views discussion

Re: New committer: Renjie Liu

2024-03-10 Thread Brian Olsen

Renjie,

I’ve already enjoyed all of our interactions, All I’ve heard in my first
year heavily interacting with the data community is asking about Rust
support. I’m looking forward to seeing Iceberg Rust take Iceberg adoption
to the top! Well deserved!

On Sun, Mar 10, 2024 at 7:40 PM Renjie Liu  wrote:

> Thanks, everyone!
>
> On Mon, Mar 11, 2024 at 12:45 AM Jan Kaul 
> wrote:
>
>> Congrats!
>>
>> Am 09.03.2024 22:38 schrieb Micah Kornfield :
>>
>> Congrats
>>
>> On Saturday, March 9, 2024, Hussein Awala  wrote:
>>
>> Congrats Renjie!
>>
>> On Sat, Mar 9, 2024 at 8:55 PM Yufei Gu  wrote:
>>
>> Congratulations and thanks for the great work in rust iceberg, Renjie!
>>
>> Yufei
>>
>>
>> On Sat, Mar 9, 2024 at 11:39 AM Steven Wu  wrote:
>>
>> Congrats, Renjie!
>>
>> On Sat, Mar 9, 2024 at 7:18 AM himadri pal  wrote:
>>
>> Congratulations Renjie.
>>
>> Regards,
>> Himadri Pal
>>
>>
>> On Fri, Mar 8, 2024 at 11:56 PM Fokko Driesprong 
>> wrote:
>>
>> Hi everyone,
>>
>> The Project Management Committee (PMC) for Apache Iceberg has invited
>> Renjie Liu to become a committer and we are pleased to announce that he has
>> accepted. We're very excited to have Renjie as a committer as he's leading
>> the effort of bringing Iceberg to the Rust world.
>>
>> Being a committer enables easier contribution to the project since there
>> is no need to go via the patch submission process. This should enable
>> better productivity. A PMC member helps manage and guide the direction of
>> the project.
>>
>> Please join me in congratulating Renjie.
>>
>> Cheers,
>> Fokko
>>
>>
>>

Re: [PROPOSAL] Improvement on our PR flows

2024-03-26 Thread Brian Olsen

The drawback to the laissez faire approach is that it doesn’t necessarily
incentivize people to take action either, and you end up getting the same
people generally scrambling to manage the PRs.

What about a system that does a basic round robin or random bot assignment
of someone from a list to the PR. That person’s job is not to review it
(though they can if they want to of course) it’s their responsibility to
coordinate with the community to get it across the line or get consensus on
what next action needs to happen.

This is a big part of what developer advocates do on the Trino side and it
works well but it does centralize a lot of the orchestration of these PRs
to few people. I had always wanted to set up a system like this eventually
to distribute the load and responsibility and offers another low barrier
entrance to participate in the community. That list could be an opt-in or
opt-out list for anyone (maybe a “reviewer” mailing list).

I have other ideas on how the list gets pruned as well, but just gonna put
this out there and see of there’s any interest in it.

On Tue, Mar 26, 2024 at 11:03 AM Ryan Blue  wrote:

> Sorry, I'm a strong -1 for having owners or standard reviewers.
>
> In this community, we've always taken the stance that anyone should be
> able to jump in and help. Having assigned owners may seem like a good idea,
> but it actually prevents other people from volunteering and getting
> involved. This is also why we don't assign issues to individuals -- they
> often don't end up submitting a PR and it prevents other people from
> contributing. Having an assigned owner gives the impression that the
> responsibility is on a particular individual, making other people that are
> capable of reviewing not pay attention. I think this will slow down the
> community and I don't think it is a good idea.
>
> Ryan
>
> On Tue, Mar 26, 2024 at 6:36 AM Ajantha Bhat 
> wrote:
>
>> +1 for having multiple PR review owners per module/label.
>>
>> Having module owners can accelerate PR processing. For instance, I'm
>> awaiting feedback on a Spark action for computing partition stats (
>> https://github.com/apache/iceberg/pull/9437). Currently, only Anton is
>> reviewing, which may cause delays if he's occupied. In my opinion, having
>> multiple module owners would enable developers to seek feedback more
>> efficiently.
>>
>> - Ajantha
>>
>> On Thu, Mar 21, 2024 at 11:11 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi folks
>>>
>>> Now that we have the proposal process "merged", I will create the PR
>>> about reviewers and update stale job.
>>>
>>> I should have the PR tomorrow for review.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On Thu, Mar 21, 2024 at 9:55 AM Jean-Baptiste Onofré 
>>> wrote:
>>> >
>>> > Hi Dan
>>> >
>>> > Yes, I saw you merged it, that's great.
>>> >
>>> > I will move forward on the "stale bot" stuff.
>>> >
>>> > Thanks !
>>> > Regards
>>> > JB
>>> >
>>> > On Wed, Mar 20, 2024 at 8:48 PM Daniel Weeks 
>>> wrote:
>>> > >
>>> > > Hey JB, apologies for combining these two things in the same thread,
>>> but we got enough eyes on the first PR and I went ahead and merged i
>>> > >
>>> > > If you want to put together the PR for your proposed changes, we can
>>> get looking at that.
>>> > >
>>> > > We'll also need to backfill the existing proposals and update the
>>> website to have a link to the label. (Will work with you and Bits on that)
>>> > >
>>> > > Thanks,
>>> > > -Dan
>>> > >
>>> > >
>>> > >
>>> > > On Wed, Mar 20, 2024 at 10:01 AM Jean-Baptiste Onofré <
>>> j...@nanthrax.net> wrote:
>>> > >>
>>> > >> Hi Fokko
>>> > >>
>>> > >> I think combining Dan's proposal about "proposal process" and this
>>> > >> proposal about "PR flows" would be helpful for the project (to track
>>> > >> the proposals and avoid "stale" PRs/proposals).
>>> > >>
>>> > >> If PMC members are OK, I'm ready to help to set this up :)
>>> > >>
>>> > >> Thanks
>>> > >> Regards
>>> > >> JB
>>> > >>
>>> > >> On Wed, Mar 20, 2024 at 12:27 PM Fokko Driesprong 
>>> wrote:
>>> > >> >
>>> > >> > Hey everyone,
>>> > >> >
>>> > >> > This is a gentle bump from my end on this thread since I like the
>>> idea. Several people have already approved Dan's PR about formalizing the
>>> proposal process. Are there any questions or concerns from the PMC before
>>> adopting this?
>>> > >> >
>>> > >> > Kind regards,
>>> > >> > Fokko Driesprong
>>> > >> >
>>> > >> > Op wo 13 mrt 2024 om 13:17 schreef Renjie Liu <
>>> liurenjie2...@gmail.com>:
>>> > >> >>
>>> > >> >> Hi, JB:
>>> > >> >>
>>> > >> >> Your proposal looks great to me. We should definitely have a
>>> vote for a proposal impacting the spec, and the model is great.
>>> > >> >>
>>> > >> >> On Tue, Mar 12, 2024 at 10:55 PM Jean-Baptiste Onofré <
>>> j...@nanthrax.net> wrote:
>>> > >> >>>
>>> > >> >>> Hi
>>> > >> >>>
>>> > >> >>> I think a vote would be necessary only if we don't have
>>> consensus on a
>>> > >> >>> proposal. If anyone is OK with the proposal (no clear "concern"
>>> in

Meeting Minutes 2024-03-27

2024-03-27 Thread Brian Olsen

Hey Iceberg Nation,

Here are the meeting minutes from today's meeting.

Transcription/Recording: https://youtu.be/9TRhgRq5bFk
Meeting Notes:
https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.686a3a6fup8m

New Issue Template for Spec and Design Proposals: Dan created a new issue
template and documentation to streamline design docs and spec updates,
improving visibility into discussions and future planning in the Iceberg
community.
Manifest Encryption Support: Manifest encryption support has been added,
further advancing Iceberg's encryption capabilities, including the ability
to handle encrypted manifests.
Function Pushdown in Spark: Anton enhanced function pushdown for row-level
commands in Spark, refining the handling of system function pushdown for
commands like merge and update.
Output Spec Selection in Rewrite Data Files: Himadri introduced a feature
allowing the selection of different output specs in rewrite data files,
enhancing the capability for data management and optimization.

Releases:
1.5.0: This version has been successfully released after numerous release
candidates.
1.5.1: Potential upcoming release with discussions around schema behaviors
with time travel reads, Trino integration improvements, and an extra HEAD
request issue in S3 file IO.
1.6.0: Planning to include an Arrow upgrade for CVE fixes.
PyIceberg 0.6.1: Focus on compatibility and bug fixes, such as addressing
the creation of version one tables with non-empty partition specs and sort
orders.

Discussion:
Spec and Design Proposals: The implementation of a new issue template and
tagging system to track and manage proposals and discussions.
REST Protocol Evolution: Discussion about adding capabilities to the REST
protocol, ensuring client and server compatibility for new features.
Schema Behavior with Time Travel Reads: Addressing the issue with schema
behavior changes during time travel reads, aiming for consistency across
branches.
Flink Table Maintenance: A deep dive into Flink table maintenance including
strategies for compaction, orphan file removal, and snapshot expiration.
Discussion on handling streaming jobs, rolling manifest writers, and the
RewriteFileGroupPlanner.
Catalog Discussion: Deliberation on deprecating the DynamoDB catalog.
How to address challenges related to the Spark 3.3 and 3.5 merge into
Iceberg, specifically around dynamic partition pruning (DPP) issues.

Re: [PROPOSAL] Improvement on our PR flows

2024-04-04 Thread Brian Olsen

I think you both (JB and Ryan) have valid points.

JB there absolutely is a need to address the scalability issue and we need
to come up with a solution. I doubt there’s any disagreement that rising
stale issues in the project should be ignored.

Ryan’s concern also has merit from a different angle that could lead to
similar outcomes (stale PRs) with the current proposed solution, as
ownership while establishing and growing responsibility, could lead to
fiefdoms.

There’s clearly some support for a solution and so submitting the proposal
as a more tangible PR is a good idea. At that point we can further revise
this solution to account for both concerns. This also would be a good one
to discuss in realtime at a community sync.

Thanks for the edited here!

Bits

On Thu, Apr 4, 2024 at 4:12 AM Jean-Baptiste Onofré  wrote:

> Anyway, I'm preparing a PR to illustrate the proposal.
>
> Regards
> JB
>
> On Thu, Apr 4, 2024 at 10:59 AM Ajantha Bhat 
> wrote:
> >
> > Additionally, I propose allocating a brief 5-10 minute segment during
> each Iceberg community sync.
> > During this time, attendees can highlight any pull requests needing
> attention.
> > In cases where a pull request has become stagnant due to a lack of
> reviews, committers can step forward to offer assistance by conducting
> reviews and aiding in its resolution.
> >
> > - Ajantha
> >
> > On Wed, Mar 27, 2024 at 12:06 PM Jean-Baptiste Onofré 
> wrote:
> >>
> >> By the way, I worked on a Python program that generate a report
> containing:
> >> - GitHub Issues
> >>   - Created since more than 6 months
> >>   - Without assignee
> >>   - Without activity (comment) since more than 7 days
> >> - GitHub PRs
> >>   - Created since more than 6 months
> >>   - Without reviewer
> >>   - With a single reviewer
> >>   - Without activity (comment, etc) since more than 7 days
> >>
> >> The report is a HTML page. I will send it on this thread today or
> >> tomorrow for review.
> >>
> >> For now, I only generate the HTML (locally on the machine), but it
> >> would be possible to publish on website or automatically (cron) send
> >> on the dev mailing list.
> >>
> >> Regards
> >> JB
> >>
> >> On Wed, Mar 27, 2024 at 6:57 AM Jean-Baptiste Onofré 
> wrote:
> >> >
> >> > Adding a group of people as reviewers doesn't block others from help
> >> > and review (and it doesn't change what we do now). I don't see how
> >> > it's different to today, just having default people reviewing, adding
> >> > new people.
> >> > Actually, we clearly have today a bunch of PRs stale just due to lack
> >> > of reviewers. From a community standard, I'm also concerned that a lot
> >> > of PR is waiting for review from the same people: that is a concern
> >> > for community engagement. If we have 3 persons that should
> >> > review/approve 90% of the PRs, it doesn't scale, it doesn't engage the
> >> > community, other committers/PMC members might be feeling "untrusted".
> >> >
> >> > So the idea is actually to grow the community: the group of reviewers
> >> > can invite other people to review (having default reviewers on some
> >> > modules doesn't block adding others). We have several examples of
> >> > Apache projects where it works fine (Apache Beam is an example, we
> >> > increased the community engagement thanks to feedback from reviewer
> >> > pretty quickly instead of stale for a while and contributors give up
> >> > due to no response).
> >> >
> >> > Anyway, I propose to update my proposal this way:
> >> > 1. I update the stale PR periodical reminder (every week)
> >> > 2. I don't add reviewers yml, but if a PR doesn't have reviewer after
> >> > a week, I send a report on the dev mailing list listing all stale and
> >> > no review started PRs)
> >> >
> >> > Regards
> >> > JB
> >> >
> >> > On Tue, Mar 26, 2024 at 5:03 PM Ryan Blue  wrote:
> >> > >
> >> > > Sorry, I'm a strong -1 for having owners or standard reviewers.
> >> > >
> >> > > In this community, we've always taken the stance that anyone should
> be able to jump in and help. Having assigned owners may seem like a good
> idea, but it actually prevents other people from volunteering and getting
> involved. This is also why we don't assign issues to individuals -- they
> often don't end up submitting a PR and it prevents other people from
> contributing. Having an assigned owner gives the impression that the
> responsibility is on a particular individual, making other people that are
> capable of reviewing not pay attention. I think this will slow down the
> community and I don't think it is a good idea.
> >> > >
> >> > > Ryan
> >> > >
> >> > > On Tue, Mar 26, 2024 at 6:36 AM Ajantha Bhat 
> wrote:
> >> > >>
> >> > >> +1 for having multiple PR review owners per module/label.
> >> > >>
> >> > >> Having module owners can accelerate PR processing. For instance,
> I'm awaiting feedback on a Spark action for computing partition stats (
> https://github.com/apache/iceberg/pull/9437). Currently, only Anton is
> reviewing, which may ca

Re: [PROPOSAL] Improvement on our PR flows

2024-04-04 Thread Brian Olsen

That seems like a good start. I do agree there needs to be a better way to
promote engagement among other members.

Perhaps I can do my next LinkedIn show describing the review process, how
Apache works, how to get started, and what NOT to do when submitting a PR.

This will likely translate into a good list and set of pages for the docs.
Or an update to some existing ones.

Would you like to join me for an episode and perhaps we can bring on a PMC
member or two of anyone is interested?

On Thu, Apr 4, 2024 at 8:14 AM Jean-Baptiste Onofré  wrote:

> Hi Brian,
>
> Yeah, I agree with your points. That's why I would like to create a PR
> as a discussion base (that we can update thanks to everyone's
> comments).
>
> 1. I think we already have a consensus about "stale issue/PR" reminder.
> 2. The concern is more about "assign/reviewer list". Rethinking this
> point, actually, we can address this with the 1: if we have reminder
> in stale PR, then someone can engage.
>
> So, I propose to start with 1, and experiment/see how it works.
>
> Regards
> JB
>
> On Thu, Apr 4, 2024 at 1:26 PM Brian Olsen 
> wrote:
> >
> > I think you both (JB and Ryan) have valid points.
> >
> > JB there absolutely is a need to address the scalability issue and we
> need to come up with a solution. I doubt there’s any disagreement that
> rising stale issues in the project should be ignored.
> >
> > Ryan’s concern also has merit from a different angle that could lead to
> similar outcomes (stale PRs) with the current proposed solution, as
> ownership while establishing and growing responsibility, could lead to
> fiefdoms.
> >
> > There’s clearly some support for a solution and so submitting the
> proposal as a more tangible PR is a good idea. At that point we can further
> revise this solution to account for both concerns. This also would be a
> good one to discuss in realtime at a community sync.
> >
> > Thanks for the edited here!
> >
> > Bits
> >
> > On Thu, Apr 4, 2024 at 4:12 AM Jean-Baptiste Onofré 
> wrote:
> >>
> >> Anyway, I'm preparing a PR to illustrate the proposal.
> >>
> >> Regards
> >> JB
> >>
> >> On Thu, Apr 4, 2024 at 10:59 AM Ajantha Bhat 
> wrote:
> >> >
> >> > Additionally, I propose allocating a brief 5-10 minute segment during
> each Iceberg community sync.
> >> > During this time, attendees can highlight any pull requests needing
> attention.
> >> > In cases where a pull request has become stagnant due to a lack of
> reviews, committers can step forward to offer assistance by conducting
> reviews and aiding in its resolution.
> >> >
> >> > - Ajantha
> >> >
> >> > On Wed, Mar 27, 2024 at 12:06 PM Jean-Baptiste Onofré <
> j...@nanthrax.net> wrote:
> >> >>
> >> >> By the way, I worked on a Python program that generate a report
> containing:
> >> >> - GitHub Issues
> >> >>   - Created since more than 6 months
> >> >>   - Without assignee
> >> >>   - Without activity (comment) since more than 7 days
> >> >> - GitHub PRs
> >> >>   - Created since more than 6 months
> >> >>   - Without reviewer
> >> >>   - With a single reviewer
> >> >>   - Without activity (comment, etc) since more than 7 days
> >> >>
> >> >> The report is a HTML page. I will send it on this thread today or
> >> >> tomorrow for review.
> >> >>
> >> >> For now, I only generate the HTML (locally on the machine), but it
> >> >> would be possible to publish on website or automatically (cron) send
> >> >> on the dev mailing list.
> >> >>
> >> >> Regards
> >> >> JB
> >> >>
> >> >> On Wed, Mar 27, 2024 at 6:57 AM Jean-Baptiste Onofré <
> j...@nanthrax.net> wrote:
> >> >> >
> >> >> > Adding a group of people as reviewers doesn't block others from
> help
> >> >> > and review (and it doesn't change what we do now). I don't see how
> >> >> > it's different to today, just having default people reviewing,
> adding
> >> >> > new people.
> >> >> > Actually, we clearly have today a bunch of PRs stale just due to
> lack
> >> >> > of reviewers. From a community standard, I'm also concerned that a
> lot
> >> >> > of PR is waiting for review from the same people: th

Re: Flink table maintenance

2024-04-08 Thread Brian Olsen

Hey Iceberg nation,

I would like to share about the meeting this Wednesday to further discuss
details of Péter's proposal on Flink Maintenance Tasks.
Calendar Link: https://calendar.app.google/83HGYWXoQJ8zXuVCA

List discussion:
https://lists.apache.org/thread/10mdf9zo6pn0dfq791nf4w1m7jh9k3sl


Design Doc: Flink table maintenance




On Mon, Apr 1, 2024 at 8:52 PM Manu Zhang  wrote:

> Hi Peter,
>
> Are you proposing to create a user facing locking feature in Iceberg, or
>> just something something for internal use?
>>
>
> Since it's a general issue, I'm proposing to create a general user
> interface first, while the implementation can be left to users. For
> example, we use Airflow to schedule maintenance jobs and we can check
> in-progress jobs with the Airflow API. Hive metastore lock might be another
> option we can implement for users.
>
> Thanks,
> Manu
>
> On Tue, Apr 2, 2024 at 5:26 AM Péter Váry 
> wrote:
>
>> Hi Ajantha,
>>
>> I thought about enabling post commit topology based compaction for sinks
>> using options, like we use for the parametrization of streaming reads [1].
>> I think it will be hard to do it in a user friendly way - because of the
>> high number of parameters -, but I think it is a possible solution with
>> sensible defaults.
>>
>> There is a batch-like solution for data file compaction already available
>> [2], but I do not see how we could extend Flink SQL to be able to call it.
>>
>> Writing to a branch using Flink SQL should be another thread, but by my
>> first guess, it shouldn't be hard to implement using options, like:
>> /*+ OPTIONS('branch'='b1') */
>> Since writing to branch i already working through the Java API [3].
>>
>> Thanks, Peter
>>
>> 1 -
>> https://iceberg.apache.org/docs/latest/flink-queries/#flink-streaming-read
>> 2 -
>> https://github.com/apache/iceberg/blob/820fc3ceda386149f42db8b54e6db9171d1a3a6d/flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/actions/RewriteDataFilesAction.java
>> 3 -
>> https://iceberg.apache.org/docs/latest/flink-writes/#branch-writes
>>
>> On Mon, Apr 1, 2024, 16:30 Ajantha Bhat  wrote:
>>
>>> Thanks for the proposal Peter.
>>>
>>> I just wanted to know do we have any plans for supporting SQL syntax for
>>> table maintenance (like CALL procedure) for pure Flink SQL users?
>>> I didn't see any custom SQL parser plugin support in Flink. I also saw
>>> that Branch write doesn't have SQL support (only Branch reads use Option),
>>> So I am not sure about the roadmap of Iceberg SQL support in Flink.
>>> Was there any discussion before?
>>>
>>> - Ajantha
>>>
>>> On Mon, Apr 1, 2024 at 7:51 PM Péter Váry 
>>> wrote:
>>>
 Hi Manu,

 Just to clarify:
 - Are you proposing to create a user facing locking feature in Iceberg,
 or just something something for internal use?

 I think we shouldn't add locking to Iceberg's user facing scope in this
 stage. A fully featured locking system has many more features that we need
 (priorities, fairness, timeouts etc). I could be tempted when we are
 talking about the REST catalog, but I think that should be further down the
 road, if ever...

 About using the tags:
 - I whole-heartedly agree that using tags is not intuitive, and I see
 your points in most of your arguments. OTOH, introducing new requirement
 (locking mechanism) seems like a wrong direction to me.
 - We already defined a requirement (atomic changes on the table) for
 the Catalog implementations which could be used to archive our goal here.
 - We also already store technical data in snapshot properties in Flink
 jobs (JobId, OperatorId, CheckpointId). Maybe technical tags/table
 properties is not a big stretch.

 Or we can look at these tags or metadata as 'technical data' which is
 internal to Iceberg, and shouldn't expressed on the external API. My
 concern is:
 - Would it be used often enough to worth the additional complexity?

 Knowing that Spark compaction is struggling with the same issue is a
 good indicator, but probably we would need more use cases for introducing a
 new feature with this complexity, or simpler solution.

 Thanks, Peter


 On Mon, Apr 1, 2024, 10:18 Manu Zhang  wrote:

> What would the community think of exploiting tags for preventing
>> concurrent maintenance loop executions.
>
>
> This issue is not specific to Flink maintenance jobs. We have a
> service scheduling Spark maintenance jobs by watching table commits. When
> we don't check in-progress maintenance jobs for the same table, multiple

Re: Iceberg table maintenance

2024-04-11 Thread Brian Olsen

Hey everyone,

Following up from this meeting, we decided due to the distributed nature of
everyone involved on Flink, it would be best to achieve consensus on some
critical points that Péter outlined for us.


   - Do you have concerns with the Maintenance Tasks execution
   infrastructure? The current plan is:


   - Streaming tasks to continuously execute the Maintenance Tasks
   - SinkV2 PostCommitTopology or Monitor service to collect the changes
   - Scheduler to decide which Maintenance Task(s) to run
   - Serialized Maintenance Task execution


   - Do you have use-cases for other Maintenance Task(s) than mentioned in
   the doc? Do you have different priorities?


   - Data file rewrite


   - Global for rewriting existing and new files - equality/positional
   delete removal, repartitioning etc
   - Incremental for rewriting only new files - no metadata read needed


   - Expire snapshots
   - Manifest files rewrite
   - Delete orphan files
   - Positional delete files - probably not needed for Flink


   - Do you think more scheduling information is need than mentioned in the
   doc?


   - Commit number
   - New data file number
   - New data file size
   - New delete file number
   - Elapsed time since the last run


   - How to prevent concurrent Maintenance Tasks?


   - Just run, and retry - will clutter the logs with failures, and waste
   resources
   - External locking
   - Using Iceberg table tags to prevent concurrent runs - mixing user data
   with technical data
   - We need a better solution [image: :smile:]



I will be following up with these members if I don't see them reply in the
following week. If you have no comment and you are on this list, please
reply 'looks good' to acknowledge you have seen it. Anyone is free to chime
in, we just wanted a list of required eyes before assuming consensus.

Péter
Ryan Blue
Daniel C. Weeks
Russell Spitzer
Steven Wu
Bryan Keller
Manu Zhang
Ajantha Bhat
Yanghao Lin
Thanks everyone!!

On Thu, Apr 11, 2024 at 7:13 AM wenjin  wrote:

> Hi Peter,
>
> I am interested in your proposal and have clarified some confusions with
> your help in Flink community, thanks for your answers.
>
> I participated in yesterday’s discussion, but since I am not a native
> English speaker, I am concerned that I may have missed some details
> regarding the follow actions about this proposal. So, if possible, please
> involve me when there are further discuss or updates.
>
> Thanks,
> wenjin
>
> On 2024/04/09 04:48:48 Péter Váry wrote:
> > Forwarding the invite for the discussion we plan to do with the Iceberg
> > folks, as some of you might be interested in this.
> >
> > -- Forwarded message -
> > From: Brian Olsen 
> > Date: Mon, Apr 8, 2024, 18:29
> > Subject: Re: Flink table maintenance
> > To: 
> >
> >
> > Hey Iceberg nation,
> >
> > I would like to share about the meeting this Wednesday to further discuss
> > details of Péter's proposal on Flink Maintenance Tasks.
> > Calendar Link: https://calendar.app.google/83HGYWXoQJ8zXuVCA
> >
> > List discussion:
> > https://lists.apache.org/thread/10mdf9zo6pn0dfq791nf4w1m7jh9k3sl
> > <
> https://www.google.com/url?q=https://lists.apache.org/thread/10mdf9zo6pn0dfq791nf4w1m7jh9k3sl&sa=D&source=calendar&usd=2&usg=AOvVaw2-aePIRr6APFVHpRDipMgX
> >
> >
> > Design Doc: Flink table maintenance
> > <
> https://www.google.com/url?q=https://docs.google.com/document/d/16g3vR18mVBy8jbFaLjf2JwAANuYOmIwr15yDDxovdnA/edit?usp%3Dsharing&sa=D&source=calendar&usd=2&usg=AOvVaw1oLYQP76-G1ZEOW5pTxV1M
> >
> >
> >
> >
> > On Mon, Apr 1, 2024 at 8:52 PM Manu Zhang  wrote:
> >
> > > Hi Peter,
> > >
> > > Are you proposing to create a user facing locking feature in Iceberg,
> or
> > >> just something something for internal use?
> > >>
> > >
> > > Since it's a general issue, I'm proposing to create a general user
> > > interface first, while the implementation can be left to users. For
> > > example, we use Airflow to schedule maintenance jobs and we can check
> > > in-progress jobs with the Airflow API. Hive metastore lock might be
> another
> > > option we can implement for users.
> > >
> > > Thanks,
> > > Manu
> > >
> > > On Tue, Apr 2, 2024 at 5:26 AM Péter Váry 
> > > wrote:
> > >
> > >> Hi Ajantha,
> > >>
> > >> I thought about enabling post commit topology based compaction for
> sinks
> > >> using options, like we use for the parametrization of streaming reads
> [1].
> > >> I think it will be h

Meeting Minutes 2024-04-17

2024-04-23 Thread Brian Olsen

Hey Iceberg Nation,

Here are the meeting minutes from last week's meeting.

Summary: We discussed fixes and improvements made to FileIO, REST catalog,
PyIceberg, and Spark integration between columns and partitions, and agreed
to handle them separately. We considered improvements for path encoding
issues with special characters, for V3 spec. Anton proposed updating the
Spark integration with Comet project for native vectorized readers being
worked on. Encryption support making good progress, just need to finalize
key management details. We're now preparing for the Iceberg Summit, agenda
and talks to be announced soon!

Transcription/Recording: https://youtu.be/Bk8mXQ6UAPs

Meeting Notes:

* Highlights
* REST catalog’s HTTP client supports proxy and timeout config (Thanks,
Harish!)
* Fixed new FileIO method defaults (Thanks, Amogh!)
* PyIceberg: Support for writing to partitioned tables that use
identity partitioning (Thanks Adrian!)
* Rust: Implement projection to perform partition based pruning (Thanks
Scott!)
* PyIceberg: Adding metadata tables (Thanks (in advance) Gowthami,
Andre, Drew, Kevin, Sung)
* PyIceberg [Pyodide](https://pyodide.org/en/stable/) integration: This
enables us to run pyIceberg in the browser via WASM without requiring an
install for folks to learn about Iceberg and table formats
https://github.com/pyodide/pyodide/issues/4644
https://github.com/pyodide/pyodide/pull/4648, we’re still waiting on the
pyArrow-Pyodide integration.
* Releases
* Java 1.5.1 Release
* JDBC Catalog: Fix Escape character in GetNamespace SQL
https://github.com/apache/iceberg/pull/9407
* JDBC Catalog: Fix JDBC Catalog table commit when migrating from
schema V0 to V1 https://github.com/apache/iceberg/pull/10111
* PyIceberg 0.6.1 Release –
https://lists.apache.org/thread/pry0n9zm2h27wbbbyslm86hh1o23q2tf
  *Milestone Link:
https://github.com/apache/iceberg-python/pulls?q=is%3Apr+milestone%3A%22PyIceberg+0.6.1%22+is%3Aclosed
* Discussion
* Field and partition ID overlap in metadata tables and columns
   * Originally partition IDs started at 1000 to avoid overlap with
column IDs from 1, but collisions happening now
   * Rather than change the IDs which would break compatibility,
best to handle columns and partitions separately
   * Zehan proposed utility methods to reassign partition IDs as
needed
* Migrating the community to the REST catalog protocol
* Goal is not to make REST API the only option, but best for
cross-language usage long-term
* Some changes needed to improve vendor-agnostic integrations
and TCK validation
* Still room for other catalog options like JDBC and Hive
metastore wrappers
* Quotes in S3 locations (https://github.com/apache/iceberg/issues/10168
).
* Issue with quotes and other special characters in partition
field values causing invalid S3 paths
* S3 allows more special characters than URI specification, so
parsing issues arise
* May restrict special chars in Iceberg itself for portability
across storage systems
* Consider standardized path encoding for V3 spec
* [Comet](https://github.com/apache/arrow-datafusion-comet) in Iceberg.
  * Comet project from Arrow will provide native vectorized readers
for Iceberg
  * Alternative to built-in reader with more features like
vectorized reads
  * Designed for Iceberg, handles projections and metadata
  * Could enable fully native Spark execution
* Special character in column names
https://github.com/apache/iceberg/issues/10120
* Spec V3 and 1.6/2.0
* File and manifest encryption implemented
* Just need to finalize key management integration
* May use key metadata in snapshots and REST API, or custom key
providers
* Iceberg Summit
* Finalizing talk selection now, will announce agenda next week
* Very high quality submissions received

Re: Call for Ryan Blue to Step Down as PMC Chair

2024-06-04 Thread Brian Olsen

+1 to JB and Jingsong,

For context, I’m currently a Tabular employee who won’t be moving on to
Databricks. I agree with the sentiments stated here about Ryan. I think the
concern around his potential employment at Databricks comes from a good
place but as JB mentioned this should be discussed by the PMC on the
private channel. I encourage us to follow up with a discussion there first.

My hopes would be that we consider this kind of action if there’s any
evidence that any PMC’s are acting in the interest of their companies over
the Iceberg community.

Bits

On Wed, Jun 5, 2024 at 12:52 AM Jingsong Li  wrote:

> Hi,
>
> +1 to Jean-Baptiste.
>
> I am not a PMC member, but what I see in the iceberg community is
> Ryan's dedication to making this community better. He has invested a
> lot of energy in the community. I have also learned a lot from him.
>
> I personally trust Ryan, and I don't think he would do anything that
> would harm the community. He is one of the people who love the iceberg
> the most.
>
> Best,
> Jingsong
>
> On Wed, Jun 5, 2024 at 11:56 AM Jean-Baptiste Onofré 
> wrote:
> >
> > Hi Natsukawa,
> >
> > I would like to remind some points:
> > 1. At Apache, you contribute as an individual.
> > 2. The PMC members manage the project, and the PMC Chair is actually a
> > VP of the foundation reporting to the board
> > (https://www.apache.org/foundation/governance/pmcs.html)
> >
> > Ryan has our trust and respect as PMC Chair. Everything he's doing is
> > for the best of the Iceberg project and committee, be sure about that.
> >
> > You are probably reacting with emotion to announcements because you
> > like the project (and it's great to see such enthusiasm for Apache
> > Iceberg). You can totally express questions/concerns, but I think it
> > would have been more productive to send a message to the PMC members
> > via the Iceberg private mailing list.
> >
> > Regarding your last sentence, I would suggest avoiding pointing to
> > some people from the Iceberg committee. We have a very healthy and
> > vibrant community, everyone should be proud and happy about that, and
> > should preserve it.
> >
> > Again, I have complete trust in Ryan, and in our PMC members. If they
> > consider changes are needed, they will act.
> >
> > Thanks
> > Regards
> > JB
> >
> > On Wed, Jun 5, 2024 at 4:13 AM Kanou Natsukawa
> >  wrote:
> > >
> > > Hi community,
> > >
> > > I'm calling for Ryan Blue to step down as Iceberg PMC chair. With the
> recent acquisition of Tabular by Databricks [1], I believe there is a
> natural conflict of interest for him to continue to be the chair of the
> Iceberg project.
> > >
> > > Tabular's official messages will likely come and say something in the
> line of they will remain neutral, but in fact everyone knows that it is not
> possible when they have signed a contract with the company owning the
> competing project, and the contract has so much money involved.
> > >
> > > I have only contributed to Iceberg once, but I still see myself as a
> part of the community. I really like how Iceberg used to be, just a very
> well-designed table format. It started to change when Tabular was formed
> and started to do their REST catalog, but Tabular has been a small player
> in the industry that their control is in general not hurting the project.
> The startup also did many great things like py-iceberg after all, and I
> guess large companies also love the REST idea since they have the resource
> to build one, it's just not every company is Netflix or Apple. With
> Databricks, I am deeply worried about the direction of the project.
> > >
> > > I propose having someone from Apple (Russell, Anton, Yufei, Steven,
> Szehon), or Jack Ye from AWS to take the PMC chair position instead, as
> they are very active PMC members in the community, and have a much more
> neutral position to safely lead the project in the right direction.
> > >
> > > And also to other Iceberg PMC members and committers from Tabular, you
> have gained a lot of wealth from this, at this moment the best thing I hope
> you can do is please keep this project alone and out of your hands.
> > >
> > > [1] https://www.databricks.com/blog/databricks-tabular
> > >
> > > Thanks
> > > Natsukawa
>

Re: Agenda Community Sync 19th June

2024-06-19 Thread Brian Olsen

Hey all!

So I just spoke with Fokko. I’ll be happy to hop on to continue the
recordings and I still owe you all some sync notes from the last few
meetings (those are still coming).

I’m not sure if Ryan will be joining given the holiday but if anything
Fokko will be back up.

On Wed, Jun 19, 2024 at 8:27 AM Renjie Liu  wrote:

> Hi, all:
>
> I want to share progress about iceberg-rust and discuss about 0.3.0
> release.
>
> On Wed, Jun 19, 2024 at 9:07 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi Jan,
>>
>> Thanks for your message.
>>
>> The document has been updated.
>>
>> I will provide two major updates from my side:
>> 1. Gradle update and revapi (both alternative and plugin fix)
>> 2. Iceberg Java 1.6.0 release preparation (including some dependency
>> updates)
>>
>> As it's Juneteenth today, if US part of the community is not there, I
>> propose Fokko moderates/drives the meeting.
>>
>> Regards
>> JB
>>
>> On Tue, Jun 18, 2024 at 8:28 PM Jan Kaul 
>> wrote:
>> >
>> > Hi all,
>> >
>> > I was wondering whether there was an agenda for the community sync
>> > tomorrow. There currently is no entry in the google doc.
>> >
>> > Best wishes,
>> >
>> > Jan
>> >
>>
>

Meeting Minutes 2024-05-08

2024-07-10 Thread Brian Olsen

Hey Iceberg Nation,

Here are the meeting minutes from last few meeting's minutes. I've had some
adjustments after moving on from Tabular, thanks for bearing with me.

Transcription/Recording

https://youtu.be/ekR0HOvjvI4

Summary
0:18 Geo support proposal has been added, community feedback is requested
3:55 View support for Hive catalog is in progress, reviewers needed
18:00 Rewrite data files API needs improvements to handle large numbers of
small files and reduce memory pressure
26:38 Discussion around having an open source reference implementation of
the Iceberg REST catalog spec
36:14 Reminder about the upcoming Iceberg Summit virtual conference

Geo Support Proposal

3:15 Geo support proposal has been added to the codebase, community is
requested to review and provide comments
3:43 Exciting new avenue to explore geometric types and geographic
partitioning functions natively in Iceberg

View Support for Hive Catalog

4:06 Work is ongoing to add view support to the Hive catalog
4:33 High-level review done, but reviewers familiar with Hive codebase
needed

Rewrite Data Files API

19:16 Current implementation can lead to high memory pressure when
rewriting large numbers of small files
19:12 Suggestions:
18:10 Add limit on number of files/bytes to rewrite in one operation
18:34 Flush data files after certain threshold instead of writing all
at commit time
19:12 Compact delete files first before rewriting data files
21:51 Need to explore Spark's data source V2 API for potential improvements

Reference REST Implementation

26:38 Discussion on having an open source reference implementation of
Iceberg's REST catalog spec
31:42 Concerns around making it an official Apache project and maintaining
it
19:12 Suggestions:
32:27 Start as a community project outside Apache, potentially under an
Apple open source repo
32:48 Make it thin, without opinionated features like metrics/security
34:03 Once mature, consider contributing to Apache Iceberg

Iceberg Summit

36:14 Reminder about the upcoming Iceberg Summit virtual conference on May
14-15
36:20 Great lineup of speakers from the community and Iceberg users

Notes

* Highlights
  * [Iceberg sink has been added](https://github.com/apache/beam/pull/30797)
to the Beam project
  * [Add pagination when listing namespaces/tables/views](
https://github.com/apache/iceberg/pull/9782) (Thanks Rahil)
  * [Stale PR management](https://github.com/apache/iceberg/pull/10134)
(Thanks JB)
  * [Use 'delete' / 'append' if OverwriteFiles only deletes/appends data
files](https://github.com/apache/iceberg/pull/10150) (Thanks Eduard)
  * [Use ‘delete’ if RowDelta only has deletes](
https://github.com/apache/iceberg/pull/10123) (Thanks Eduard)
  * [JDBC: Fix escape character used in Namespace SQL](
https://github.com/apache/iceberg/pull/10167) (Thanks Chauncy, JB)
  * [JDBC: Fix issue when migrating from V0 schema to V1](
https://github.com/apache/iceberg/pull/10152) (Thanks JB)
  * [Add support for Flink 1.19](
https://github.com/apache/iceberg/pull/10112) (Thanks Rodrigo)
  * [Flink: Apply delete granularity for writes](
https://github.com/apache/iceberg/pull/10200) (Thanks Peter)
  * Iceberg Summit is Next Week!
* Releases
  * Java 1.5.2 Release
* Please vote!
* 1.5.2 has the same source code as 1.5.1.
* This release is being performed due to an issue with the released
Spark artifacts in 1.5.1. Some artifacts were compiled with the incorrect
Scala version due to an unclear reason.
  * Wrapping up the PyIceberg 0.7.0 release. If there is anything that you
want to get in, please add it to the [0.7.0 Milestone](
https://github.com/apache/iceberg-python/milestone/2).
* Discussion
  * Geo support proposal is up: [
github.com/apache/iceberg/issues/10260](http://github.com/apache/iceberg/issues/10260)
please take a look. Can discuss more in next sync
  * View support for Hive-Catalog:
https://github.com/apache/iceberg/pull/9852

Meeting Minutes 2024-05-29

2024-07-10 Thread Brian Olsen

Hey Iceberg Nation,

Here are the meeting minutes from last few meeting's minutes. I've had some
adjustments after moving on from Tabular, thanks for bearing with me.

Transcription/Recording

https://youtu.be/5xkhGDfFvGU

Summary

0:12 Iceberg Summit was a big success with great community participation
8:44 Planning for 1.6 release, JB volunteered as release manager
16:47 Need to clarify and document materialized views proposal
30:34 Discussed concerns around OAuth2 endpoint in REST catalog spec and
need for flexibility in authentication mechanisms

Iceberg Summit Recap

0:35 Twice as many talks as planned due to high interest
1:26 Videos will be posted to YouTube soon
1:17 Great sponsorship and plan to make it a yearly event

Release Updates

8:49 1.5.3 patch release for Jackson issue deferred to 1.6 release
11:54 Milestone for 1.6 release to be created to track issues

Java Highlights

2:30 Rev API fixes moved out of Gradle
4:12 Connection retries added to JDBC catalog
4:39 Bloom filter sensitivity configuration added
4:47 Support for partitioning by UUID
5:12 Coercing underlying Parquet types

Python Highlights

5:43 Entire logic fixes
5:55 Reading tables from Glue catalog
6:00 Initial manifest table implementation
6:09 Hierarchical SQL catalog namespaces
6:21 Categorical type support

Rust Highlights

6:42 Inclusive metrics evaluator progress
7:05 File predicate pushdown to Arrow reader
6:36 Expression support added

Go Highlights

8:01 Expression and literal support added

Materialized Views Proposal

16:47 Need to document current state and next decision points
16:47 Separate meeting to discuss storage approach details

REST Catalog Security

30:34 Discussed concerns around OAuth2 endpoint specification
30:34 Need to clarify intent and optionality
30:34 Potentially remove endpoint and rely on client config
30:34 Ensure flexibility for different authentication flows

Notes

* Highlights
  * Iceberg Summit 🎉 (Videos [@ApacheIceberg](
https://www.youtube.com/@ApacheIceberg) YouTube later today)
  * Java
* Moved RevAPI outside of Gradle, to unblock Gradle upgrades ([#10386](
https://github.com/apache/iceberg/pull/10386)) (Thanks JB!)
* Core: Retry connections in JDBC catalog with user configured error
code list ([#10140](https://github.com/apache/iceberg/pull/10140)) (Thanks
Amogh Jahagirdar!)
* Parquet: Add Bloom filter FPP config ([#10149](
https://github.com/apache/iceberg/pull/10149)) (Thanks Huaxin Gao!)
* Spark: Fix issue when partitioning by UUID ([#8250](
https://github.com/apache/iceberg/pull/8250)) (Thanks Eduard Tudenhoefner!)
* Spark: Coerce shorts and bytes into ints in Parquet Writer ([#10349](
https://github.com/apache/iceberg/pull/10349)) (Thanks Shardul Mahadik)
* Add REST Catalog Iceberg Views to Trino ([trinodb/trino#19818](
https://github.com/trinodb/trino/pull/19818))
  * Python
* Hive catalog: Add retry logic for hive locking ([#701](
https://github.com/apache/iceberg-python/pull/701)) (Thanks frankliee!)
* Glue: register table using iceberg metadata file via pyiceberg
([#711](https://github.com/apache/iceberg-python/pull/711)) (Thanks Mehul
Batra!)
* Initial implementation of the manifest table ([#717](
https://github.com/apache/iceberg-python/pull/717)) (Thanks Drew Gallardo!)
* Introduce hierarchical namespaces into SqlCatalog ([#591](
https://github.com/apache/iceberg-python/pull/591)) (Thanks Eric L!)
* Add support for Arrow categorical type ([#693](
https://github.com/apache/iceberg-python/pull/693)) (Thanks Sung!)
  * Rust
* Implement InclusiveMetricsEvaluator ([#347](
https://github.com/apache/iceberg-rust/pull/347)) (Thanks Scott!)
* Push down predicate into Parquet reader ([#295](
https://github.com/apache/iceberg-rust/pull/295)) (Thanks Liang-Chi)
  * Go
* Add Literals ([#76](https://github.com/apache/iceberg-go/pull/76/))
(Thanks, Matt!)
* Releases
  * 1.5.3 Release?
* Prevent deadlock in Jackson [#10379](
https://github.com/apache/iceberg/pull/10379)
  * 1.6.0 Release target date?
* June 2024
* Discussion ([all proposals](
https://github.com/apache/iceberg/labels/proposal))
  * Materialized Views ([#10043](
https://github.com/apache/iceberg/issues/10043))
  * REST Topics / Status Updates
* Security questions raised on [the mailing list](
https://lists.apache.org/thread/twk84xx7v0xy5q5tfd9x5torgr82vv50)
* (Pre)Plan Endpoints ([#9695](
https://github.com/apache/iceberg/pull/9695))
* Append Endpoints ([#10202](
https://github.com/apache/iceberg/pull/10202))
* Server Capabilities ([#9940](
https://github.com/apache/iceberg/pull/9940))
* Client-side purge in REST catalog ([#10089](
https://github.com/apache/iceberg/issues/10089))
* SQL UDFs [mailing list discussion](
https://lists.apache.org/thread/gzofnqqpq5m0cjpts8z5k9rkz4y8gv10)
* Access decision exchange ([#10395](
https://github.com/apache/iceberg/issues/10395))
  * Go Over Geo Proposal [#10260](
http://github.com/apache/iceberg

Meeting Minutes 2024-06-19

2024-07-10 Thread Brian Olsen

Hey Iceberg Nation,

Here are the meeting minutes from last few meeting's minutes. I've had some
adjustments after moving on from Tabular, thanks for bearing with me.

Transcription/Recording

https://youtu.be/j1GncDMj8HY

Summary

0:11 Several contributors have switched employers, but remain committed to
the Iceberg community
6:44 Releases are planned for Iceberg Java 1.6, Rust, and Python
13:13 Discussions around OAuth 2.0 integration, REST catalog, and
materialized views
25:52 Proposal to have separate sync meetings focused on REST catalog
development

Employer Changes

Some contributors (e.g. from Tabular acquired by Databricks) have changed
employers, they remain committed to the Iceberg community and open
collaboration

Release Updates

Iceberg Java 1.6

7:00 Include Kafka commit code and Gradle/Rev API updates
7:22 JB to start the release process next week

Rust

7:48 Focus on read support - file pruning, filter pushdown, Arrow
integration
9:06 Waiting on fix for field ID bug before releasing

Python

5:25 Making progress on snapshot management capabilities

OAuth 2.0 Integration

13:13 Discussions around keeping/removing the tokens endpoint
13:25 Need to build shared understanding before deciding
17:46 Proposal to have a separate vote if consensus can't be reached

REST Catalog

25:52 Proposal to have separate sync meetings focused just on REST catalog
development
29:56 Move REST catalog spec to a separate repo? Concerns around breaking
links
31:42 Have regular note-taking for sub-sync meetings

Materialized Views

32:49 Good progress made, most functional questions answered
35:16 Some remaining naming/terminology discussions
34:05 Plan to start PR to adopt spec definitions soon

Server Capabilities

37:28 Important for indicating support for endpoints like plan/pre-plan
42:30 Discussions around authentication, backward compatibility
45:04 Likely to bring up for a community vote

Other Topics

32:35 Quick updates on SQL UDFs, rebranding, supporting more storage
services, geospatial data types

Notes

* Highlights
  * Iceberg Summit [video’s are online](
https://www.youtube.com/@ApacheIceberg/videos), please check them out!
  * Java
* Thanks Eduard for [forking the rev-api plugin](
https://plugins.gradle.org/plugin/io.github.nastra.revapi) to support
Gradle greater than 8.1, thanks [JB for doing the actual work](
https://github.com/nastra/gradle-revapi/pull/11)! Alternative PR has been
created to avoid use of unmaintained revapi gradle plugin ([#10386](
https://github.com/apache/iceberg/pull/10386)).
* Thanks Manu Zhang for the [TLC](
https://github.com/apache/iceberg/pull/10374) [for](
https://github.com/apache/iceberg/pull/10394) [the](
https://github.com/apache/iceberg/pull/10397) [docs](
https://github.com/apache/iceberg/pull/10463)
* Thanks Piotr for [working towards JDK21](
https://github.com/apache/iceberg/pull/10474) support and [fixing](
https://github.com/apache/iceberg/pull/10521) [a lot](
https://github.com/apache/iceberg/pull/10530) [along](
https://github.com/apache/iceberg/pull/10485) [the](
https://github.com/apache/iceberg/pull/10517) [way](
https://github.com/apache/iceberg/pull/10475).
* Thanks Szehon for fixing [rewrite positional deletes files](
https://github.com/apache/iceberg/pull/10020) on tables with 1k+ columns.
  * Python
* Thanks Chinmay Bhat for adding snapshot management; [creation of tags
and branches](https://github.com/apache/iceberg-python/pull/728), and
looking up [snapshots by datetime](
https://github.com/apache/iceberg-python/pull/748).
* Thanks Sung for adding partitioned writes based for [temporal
partitioned writes](https://github.com/apache/iceberg-python/pull/784).
  * Rust
* Thanks Shabana for adding the PredicateVisitor for [filtering out
manifests](https://github.com/apache/iceberg-rust/pull/367).
  * Go
* Thanks Matt for adding [predicates and expressions](
https://github.com/apache/iceberg-go/pull/91)!
* Upcoming releases
  * Iceberg 1.6.0 Release ([devlist](
https://lists.apache.org/thread/ymx4kbbfmndmhlrzfrpgzj3hmo6294pv),
[milestone](https://github.com/apache/iceberg/milestone/44)) scheduled for
June. Preparation started, plan to submit to vote next week (JB is release
manager for this one).
  * Iceberg-rust 0.3.0 release ([devlist](
https://lists.apache.org/thread/x1kn3oq5lv6hllf1d50pbyrwcwthy4t1),
[tracking issue](https://github.com/apache/iceberg-rust/issues/348),
[project](https://github.com/orgs/apache/projects/339/views/1))
  * PyIceberg heading to 0.7.0 ([tracking issue](
https://github.com/apache/iceberg-python/issues/736))
* Discussion ([all proposals](
https://github.com/apache/iceberg/labels/proposal))
  * [Path forward](https://github.com/apache/iceberg/issues/10537) for the
OAuth2 [concerns](
https://lists.apache.org/thread/twk84xx7v0xy5q5tfd9x5torgr82vv50) in the
REST Catalog (Dmitri/Robert)
  * Iceberg Materialized Views (Jan) ([Spec](
https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMe

Meeting Minutes 2024-07-10

2024-07-10 Thread Brian Olsen

Hey Iceberg Nation,

Here are the meeting minutes from last few meeting's minutes. I've had some
adjustments after moving on from Tabular, thanks for bearing with me.

Transcription/Recording

https://youtu.be/jAWka8g0o7c

Summary

0:11 Significant progress on geospatial support proposal, addressing key
aspects like geometry types, encoding, partition transforms, predicate
pushdown etc.
0:16 Variant data type proposal from Snowflake team to add support for JSON
superset "variant" type in Iceberg
8:44 Discussion on 1.6 release blockers and next steps for other releases
(Python, Rust)

Recent Updates

0:23 RevAPI plugin forked to new repo to continue using with Gradle
1:00 Parallel file listings for Snapshot/Migrate commands
1:29 Data file content filters pushdown for metadata table scans
3:00 Aggregate pushdown for incremental scans
3:27 Flink performance improvements
4:15 Removing credential override in table sessions

Geospatial Support Overview

14:21 Background on open-source geospatial support
14:56 Geometry as a complex type following OGC standards
16:40 WKB encoding mapped to Parquet byte arrays
18:36 Parquet logical type for geometry stats (bounding boxes)
20:25 XZ2 partition transform to map geometries deterministically
21:22 Geospatial sort orders like Hilbert curve
22:12 Integration with Sedona for Spark expressions

Variant Data Type Proposal

33:46 Variant as JSON superset with richer types (timestamps etc.)
40:06 Encoding discussion - leaning towards Spark variant binary
44:12 Support both JSON and variant, with variant as superset
52:21 Potential performance improvements with sub-column extraction

Releases

7:25 1.6 release blockers: commit-isCommitted PR, Avro release
9:25 Rust release pending Avro 1.11/1.12
10:47 Python 0.7 release wrapping up arrow integration

Next Steps

* Review and approve pull request for Iceberg 1.6 release
* Consider implications of supporting both well-known binary and Parquet
logical type for geospatial, and push for Parquet logical type if possible
* Evaluate potential use cases for bounding box intersections at manifest
list level, beyond just geospatial
* Get more details from data types team on variant proposal, including how
to map variant to JSON and any performance implications of separating
subcolumns
* Continue discussion on whether to support variant type, JSON type, or
both in Iceberg, considering benefits and downsides of each option

Notes

* Highlights
  * Java
* Thanks Ajantha, JB and Eduard for [working on](
https://github.com/apache/iceberg/pull/8486/) the [RevAPI upgrade](
https://github.com/apache/iceberg/pull/10631)
  * RevAPI moved now from https://github.com/palantir/gradle-revapi to
https://github.com/revapi/gradle-revapi
* Spark snapshot and migrate procedures can parallelize file listing
(Thanks, Manu!)
* Added pushdown for data file content filters in entries metadata
table (Thanks, Steve Zhang!)
* Added aggregate pushdown for incremental scans (Thanks, Huaxin!)
* Flink performance improved by pre-creating getters (Thanks,
@fengjiajie!)
* Fixed credential override in table sessions (Thanks, Alexandre!)
  * Python
* Thanks Sung for [streaming data](
https://github.com/apache/iceberg-python/pull/786) through an arrow batch
reader.
* Thanks Honah for adding [merge-appends](
https://github.com/apache/iceberg-python/pull/569).
  * Rust
* Thanks Zenotme for writing [Field-IDs](
https://github.com/apache/iceberg-rust/pull/411) to the Avro files.
  * Spark
* Can pass read properties via [Spark SQL](
https://github.com/apache/spark/pull/46707) (Szehon)
* Releases
  * Iceberg 1.6.0 Release ([devlist](
https://lists.apache.org/thread/ymx4kbbfmndmhlrzfrpgzj3hmo6294pv),
[milestone](https://github.com/apache/iceberg/milestone/44)).
* [Kafka Connect: Commit coordination](
https://github.com/apache/iceberg/pull/10351)
  * Iceberg-rust 0.3.0 release ([devlist](
https://lists.apache.org/thread/x1kn3oq5lv6hllf1d50pbyrwcwthy4t1),
[tracking issue](https://github.com/apache/iceberg-rust/issues/348),
[project](https://github.com/orgs/apache/projects/339/views/1)). Waiting
for the Avro release.
  * PyIceberg heading to 0.7.0 ([tracking issue](
https://github.com/apache/iceberg-python/issues/736)), wrapping up the
final PRs
  * Progress for project guidelines
* Discussion ([all proposals](
https://github.com/apache/iceberg/labels/proposal))
  * Geo Support Overview (if helpful?)
* Column ranges in ManifestFile metadata (stored in manifest list)
  * Spec changes
* Relative paths in Iceberg metadata (revisit this [proposal](
https://docs.google.com/document/u/0/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit
))
* Variant [proposal](
https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit#heading=h.rt0cvesdzsj7
)
* Additional types
  * timestamp_ns
  * Variant
  * Null type?
  * Geo
  * TimeUUID (in binary comparable representation)
  * Blob?

Re: Administration of Apache Iceberg Social/Marketing Channels

2024-07-18 Thread Brian Olsen

Hey Ryan,

I will take that down, apologies for that confusion. I remember having some
discussion around this before and my recollection was this was more of a
concern about vendor content but after I read your reply I remembered this
was a more general ASF sentiment. So with that, I still think distributing
these through various channels would be valuable, while exposing them
through playlists. For example, we added the Data Council talk you gave a
while back where the video is hosted on Data Council's channel, but we've
added it to the "Iceberg Talks Playlist".

Perhaps this would be a matter of having the PMC create some guidance on
certain playlists like the meetups playlist I just created and then I can
add videos from other channels that say Kevin and others own to upload
Iceberg meetups and we can push visibility to certain videos from their
channels that meet the criteria (like avoiding talks that are pushing a
particular vendor or product for example). Would this align more with what
you have in mind to avoid endorsements?

On Thu, Jul 18, 2024 at 3:59 PM Ryan Blue 
wrote:

> I strongly prefer not posting meetup videos on an official Apache Iceberg
> channel.
>
> That's why we have up until now used the channel only for videos from
> sanctioned Iceberg events <https://www.youtube.com/@ApacheIceberg/videos>,
> like community syncs or the Apache Iceberg Summit. I wasn't aware that Bits
> posted the last Seattle meetup video.
>
> An "official" distribution channel like this (one controlled by the
> Iceberg PMC) needs to be careful about posting content because it can be
> seen as an endorsement and there are reasonable objections around posting
> some videos (for instance, they are strongly biased to a particular
> solution or vendor). I don't want the PMC to be in the business of deciding
> what gets to be on the official channel or not -- that's a whole set of
> arguments that I think we want to steer clear of. This is also the reason
> why the Iceberg site doesn't host blog posts, by the way. We originally had
> requests for people to post articles, but I don't think it is wise for the
> project to decide what opinions can be posted there.
>
> I think that we should remove the last meetup video and only post content
> like the community sync videos. For everything else, I think it is a good
> idea to have a separate channel. We can link to those channels to make them
> easy to find, but we don't want to be in the middle when someone has an
> opinion about what should or should not be posted and should not be seen as
> endorsing any particular solutions.
>
> Ryan
>
> On Thu, Jul 18, 2024 at 1:44 PM Jack Ye  wrote:
>
>> > posting/content management privileges to the Iceberg YouTube channel
>> for these meetup recordings
>>
>> This sounds reasonable to me, when the meetup is recurring. So I am good
>> with doing this for the Seattle meetup series. We have technically already
>> done so for Bits for the community sync meeting series.
>>
>> Thoughts?
>>
>> -Jack
>>
>>
>>
>>
>> On Thu, Jul 18, 2024 at 1:23 PM Jonathan Leang 
>> wrote:
>>
>>> Hey all,
>>>
>>> I have a question related to some questions about shared administration
>>> of Apache Iceberg social/marketing channels such as YouTube in this
>>> specific instance. Kevin Liu and I have organized the first Seattle Iceberg
>>> meetups and it looks like there will be a steady stream of technical talks
>>> about things like proposals and more generally recent developments in the
>>> project. Bits (Brian Olsen) helped us get the most recent meetup posted on
>>> the Apache Iceberg YouTube channel but we're hoping not to bother him every
>>> month. We're wondering if Kevin and/or I can get posting/content management
>>> privileges to the Iceberg YouTube channel for these meetup recordings.
>>>
>>> We asked Bits and he replied:
>>> It would be good to have some trusted people + PMC coordinate the social
>>> + marketing channels. In many projects I either see nobody really owning it
>>> and it becomes abandoned over time and other times I see it primarily run
>>> through one or two companies or individuals which doesn't scale. It would
>>> be great if a small group of trusted volunteers comanage these marketing
>>> channels with similar guidelines on how to coordinate when, channel
>>> "voice", and following standard ASF COC guidelines, etc...
>>>
>>> Perhaps Bits and/or a PMC member could take on the responsibility of
>>> building out the guidance on how to run this without too much overlap or
>>> interference as it scales out.
>>>
>>> Thanks!
>>> Jonathan Leang
>>>
>>
>
> --
> Ryan Blue
> Databricks
>

Re: Administration of Apache Iceberg Social/Marketing Channels

2024-07-18 Thread Brian Olsen

Update: I moved the video and playlist to a Private status until this is
discussed. In the meantime, we'll wait to hear some guidance from the PMC
and community to advise. Otherwise, I suggest that Jonathan and Kevin
create an unofficial community Apache Iceberg Meetups channel that follows
the trademark guidelines: https://apache.org/foundation/marks/ but
have asked them to give some time to let others respond.

Thanks all!


On Thu, Jul 18, 2024 at 4:41 PM Brian Olsen  wrote:

> Hey Ryan,
>
> I will take that down, apologies for that confusion. I remember having
> some discussion around this before and my recollection was this was more of
> a concern about vendor content but after I read your reply I remembered
> this was a more general ASF sentiment. So with that, I still think
> distributing these through various channels would be valuable, while
> exposing them through playlists. For example, we added the Data Council
> talk you gave a while back where the video is hosted on Data Council's
> channel, but we've added it to the "Iceberg Talks Playlist".
>
> Perhaps this would be a matter of having the PMC create some guidance on
> certain playlists like the meetups playlist I just created and then I can
> add videos from other channels that say Kevin and others own to upload
> Iceberg meetups and we can push visibility to certain videos from their
> channels that meet the criteria (like avoiding talks that are pushing a
> particular vendor or product for example). Would this align more with what
> you have in mind to avoid endorsements?
>
> On Thu, Jul 18, 2024 at 3:59 PM Ryan Blue 
> wrote:
>
>> I strongly prefer not posting meetup videos on an official Apache Iceberg
>> channel.
>>
>> That's why we have up until now used the channel only for videos from
>> sanctioned Iceberg events <https://www.youtube.com/@ApacheIceberg/videos>,
>> like community syncs or the Apache Iceberg Summit. I wasn't aware that Bits
>> posted the last Seattle meetup video.
>>
>> An "official" distribution channel like this (one controlled by the
>> Iceberg PMC) needs to be careful about posting content because it can be
>> seen as an endorsement and there are reasonable objections around posting
>> some videos (for instance, they are strongly biased to a particular
>> solution or vendor). I don't want the PMC to be in the business of deciding
>> what gets to be on the official channel or not -- that's a whole set of
>> arguments that I think we want to steer clear of. This is also the reason
>> why the Iceberg site doesn't host blog posts, by the way. We originally had
>> requests for people to post articles, but I don't think it is wise for the
>> project to decide what opinions can be posted there.
>>
>> I think that we should remove the last meetup video and only post content
>> like the community sync videos. For everything else, I think it is a good
>> idea to have a separate channel. We can link to those channels to make them
>> easy to find, but we don't want to be in the middle when someone has an
>> opinion about what should or should not be posted and should not be seen as
>> endorsing any particular solutions.
>>
>> Ryan
>>
>> On Thu, Jul 18, 2024 at 1:44 PM Jack Ye  wrote:
>>
>>> > posting/content management privileges to the Iceberg YouTube channel
>>> for these meetup recordings
>>>
>>> This sounds reasonable to me, when the meetup is recurring. So I am good
>>> with doing this for the Seattle meetup series. We have technically already
>>> done so for Bits for the community sync meeting series.
>>>
>>> Thoughts?
>>>
>>> -Jack
>>>
>>>
>>>
>>>
>>> On Thu, Jul 18, 2024 at 1:23 PM Jonathan Leang 
>>> wrote:
>>>
>>>> Hey all,
>>>>
>>>> I have a question related to some questions about shared administration
>>>> of Apache Iceberg social/marketing channels such as YouTube in this
>>>> specific instance. Kevin Liu and I have organized the first Seattle Iceberg
>>>> meetups and it looks like there will be a steady stream of technical talks
>>>> about things like proposals and more generally recent developments in the
>>>> project. Bits (Brian Olsen) helped us get the most recent meetup posted on
>>>> the Apache Iceberg YouTube channel but we're hoping not to bother him every
>>>> month. We're wondering if Kevin and/or I can get posting/content management
>>>> privileges to the Iceberg YouTube channel for these me

Meeting Minutes 2024-07-31

2024-10-02 Thread Brian Olsen

Hey Iceberg Nation,

Here are the meeting minutes from the July 31st meeting. The three more
recent meeting minutes are about to follow! JB, they are official now ;)

Transcription/Recording

https://youtu.be/bN8OSHPApSk

Summary

0:00 New Committers and PMC Members @ 0:00

Welcomed several new Committers and PMC members, including Hona, Sun,
Kevin, Sean, Renjie, and Piotr.

1:53 Iceberg 1.6 Release @ 1:53

This release adds Avro CVE fixes and a memory fix contributed by new
committer Piotr. There was also discussion around the timing of the 1.7
release and the possibility of an earlier release to get some key updates
out sooner.

22:07 Other Updates @ 22:07

- Improvements to Flink support, including limit pushdown and speculative
execution
- Progress on Python, Rust, and Go implementations
- Upcoming changes to the Iceberg catalog sync meeting schedule

25:48 Iceberg Format Version 3 (V3) @ 25:48

Here are the key roadmap items for Iceberg Format Version 3 (V3):

- What features should be included in V3 (e.g. timestamp nanos, variants,
null type, type promotion, default values)
- Whether to tie the V3 release to a specific Iceberg version or keep them
decoupled
- Ensuring the V3 spec is implementable across languages, not just the Java
reference implementation
- Formalizing more table properties and configuration values in the spec

55:15 Next Steps @ 55:15

The group agreed to continue the discussion around Iceberg Format V3 on the
dev mailing list, with the goal of defining a clear roadmap and getting
commitments from contributors to drive key features to completion. The team
will also work on improving the tracking and management of the V3 spec
changes.

Notes

* Highlights
  * Welcome new committers and PMC members!
  * Java
* Iceberg 1.6.0 has been released! Thanks JB!! 🎉🎉🎉
* [Core: Support appending files with different partition specs](
https://github.com/apache/iceberg/pull/9860) (Farooq Qaiser)
* Flink added limit pushdown for FLIP-27 source (Thanks, Steven!)
* Added Flink speculative execution support (Thanks, Venkata!)
* Kafka Connect now has a runtime distribution (Thanks, Bryan!)
* Updated ParallelIterable to limit memory consumption (Thanks, Piotr!)
  * Python
* [PyIceberg 0.7.0 has been released](
https://lists.apache.org/thread/93p74trcg8wh0qwhdcc120v10wfnwpb1)! Thanks
Sung!! 🎉🎉🎉
  * Rust
* [Avro-rs release, waiting on a final binding vote](
https://lists.apache.org/thread/tqj0zk7qqsgr6tw10247grz3dv3svhtn)
* [Add in-memory catalog implementation](
https://github.com/apache/iceberg-rust/pull/475) (Farooq Qaiser)
  * Go
* [Boolean expression visitors](
https://github.com/apache/iceberg-go/pull/108), thanks Matt
* Releases
  * 1.6.1 release
* Bugs that should be fixed?
* Avro CVE
  * 1.7.0 timeline and release manager
* Russell volunteering to be RM
* LICENSE and NOTICE for kafka connect runtime
* KC runtime Jar!
* Discussion
  * Short items:
* LICENSE and NOTICE - opportunity for new contributors
* Kafka Connect status
* REST catalog sync
  * Should [format-version=3](
https://github.com/apache/iceberg/blob/apache-iceberg-1.6.0/format/spec.md?plain=1#L1290-L1315)
become formally adopted in Iceberg 1.7.0?
* Interest in releasing [new column type](
https://github.com/apache/iceberg/issues/10775).
* [Default values](https://github.com/apache/iceberg/pull/2496) and
[timestamp_ns](https://github.com/apache/iceberg/pull/9008) appear to be
ready.
* [Multi-arg transforms](https://github.com/apache/iceberg/issues/8258)
would be deferred to format-version=4
  * [Namespace separator 0x1f](
https://github.com/apache/iceberg/issues/10338)

Meeting Minutes 2024-10-02

2024-10-02 Thread Brian Olsen

Hey Iceberg Nation,

Here are the meeting minutes from the October 2nd meeting.

Transcription/Recording

https://youtu.be/4rQL8IMsajc

### 1.7 Release Planning

6:29 Target release date set for October 31st, 2023
7:20 Branch cut planned for mid-October
7:39 Some V3 spec features may be included (e.g. default values, type
promotion)
9:02 Connect licensing PR expected to be included
15:47 Community members encouraged to review 1.7 milestone and add PRs

### C++ Puffin Reader/Writer

17:22 Proposal to create new Iceberg C++ library approved
16:49 Initially focused on Puffin implementation for Impala
17:22 Follows existing pattern of language-specific libraries
17:36 Community can propose additional functionality in the future

### Standardizing Credentials in REST API

22:00 Ongoing discussion about structure of credentials in API responses
23:00 Proposal to have well-defined credential structure for easier
reasoning
24:27 Concerns raised about potential future changes to credential fields
24:41 Agreement to review refresh endpoint proposal before finalizing
decision

### Materialized Views Specification

30:30 Challenges with catalog naming inconsistencies across query engines
34:49 Proposal to use UUIDs for table identification in metadata. How to
map UUIDs back to catalog-specific identifiers. Consider SQL parsing as
fallback solution to avoid immediate spec changes

### File

39:13 Use File IO or table operations mechanism to refresh expired
credentials in File IO instances
40:00 Proposal for new File IO API to allow credential refreshing without
rebuilding object

42:51 Planning underway for Iceberg Summit 2025

Meeting Minutes 2024-08-21

2024-10-02 Thread Brian Olsen

Hey Iceberg Nation,

Here are the meeting minutes from the Aug 21st meeting.

Transcription/Recording

https://youtu.be/bN8OSHPApSk

Summary

Project Updates
3:19 Flink support added for 2020 copy and range distribution
4:30 Java 8 support dropped; now targeting Java 11
5:10 V3 format progress: metadata classes copied, upgrade testing added
6:00 S3 recovery operations implemented for repair table action
6:56 Deprecated APIs removed for 1.7 release
7:02 Column stats now exposed to Spark for better query planning
7:44 Rust: 0.3 release out, SQL catalog contribution, manifest list caching
added
8:19 Python: 0.7.1 release out

Upcoming Releases
9:11 1.6.1 vote thread ongoing - focuses on reducing Trino memory
consumption
9:35 1.7 preparing proposals for V3 spec changes
10:25 Default value support needed for Parquet and ORC formats

Row Lineage Proposal
11:13 Aims to identify and track changes to individual rows over time
12:30 Leaning towards global identifier approach
12:41 Two key fields: row identifier and row version (likely using sequence
number)
13:31 Open questions on additional versioning information

Row-level Deletes Improvements
22:17 Proposal addresses shortcomings in current implementation
22:33 Suggests synchronous maintenance of delete files
22:41 Splits metadata tracking at file level, but rolls up to larger files
22:53 Would require synchronous delete maintenance from V3 forward
28:24 Helps with CDC and change log capabilities

Type Promotion
32:16 Current stats lack original type information for promoted types
32:37 Proposal to limit scope for V3 to promotions determinable by byte
count
32:48 Int/long to string promotion tricky; may use lower/upper bound byte
count heuristic
33:21 Community feedback requested on GitHub PR with spec changes

UI for Iceberg
34:11 Prototype UI developed to visualize namespaces, tables, properties,
etc.
35:18 Discussion on whether to include in core project or keep separate
35:35 Consensus leans towards maintaining as separate community project
41:47 Suggestion to create "awesome list" for Iceberg-related tools/projects

REST Catalog Testing
46:31 PR adds lightweight server to run existing catalog tests against REST
implementations
46:40 Provides standardized behavior testing across implementations
46:51 Can be pointed at any REST server exposing Iceberg protocol
47:25 \~100 tests available out-of-the-box

Geo Type Specification
53:58 Email sent with geo spec details for review
51:25 Questions raised about deriving bounding box values from well-known
binary format
52:11 May need clarification in Iceberg spec on extracting values from
Parquet stats

Meeting Minutes 2024-09-11

2024-10-02 Thread Brian Olsen

Hey Iceberg Nation,

Here are the meeting minutes from the September 11th meeting.

Transcription/Recording

https://youtu.be/4oDdra5cYl8

Project Updates
0:00 Meeting Start
1:00 Docs improvements ongoing, thanks to Manu
1:24 PR merging guidelines published
1:53 Druid added Iceberg support
2:15 Spec updates:
2:23 Remove partition spec API added
3:10 Endpoint config and server-side planning endpoints added
4:56 Java updates:
5:00 Incremental snapshot cleanup fix
5:36 Parallel table migration backported
5:51 REST compatibility test kit added
6:21 Flink V2 sync API added
7:00 Timestamp nanoseconds implementation

### 1.7.0 Release Planning
17:47 Variant spec moved to Parquet project
9:09 New delete file layout proposal targeting inclusion
9:22 Row lineage proposal nearing consensus
11:37 Implementation work ongoing for various V3 features
11:50 Considering splitting some V3 features to 1.8 if not ready

### V3 Deletes Proposal
25:57 Proposal received positive feedback
26:07 PRs in progress to demonstrate implementation details
27:05 Seeking community review and consensus

### Row IDs and Tracking
28:51 Settled on global metadata high watermark for row identifiers
29:00 Version tracking last sequence number that updated a row
29:18 Spec PR in progress

96 matches

Mail list logo