Thanks everyone for nice discussion.

+1 for multi repo while keeping core spec and java implementation in 
apache/iceberg. Currently java is still most widely adopted and sophisticated 
implementation. We only need to help people to find other implementation by 
providing links in web page.


From: Ryan Blue <b...@tabular.io>
Date: Friday, August 11, 2023 at 05:23
To: dev@iceberg.apache.org <dev@iceberg.apache.org>
Subject: Re: Discussion about the location of language clients
I wasn't at the discussion on Wednesday, but it sounds like there is support 
for moving to separate repos. Does anyone strongly object?

I also agree with Steven on not renaming to iceberg-java. That's the repo where 
we keep the spec and Java is the reference implementation. Plus we don't want 
to break a ton of links.

Ryan

On Thu, Aug 10, 2023 at 1:05 PM Steven Wu 
<stevenz...@gmail.com<mailto:stevenz...@gmail.com>> wrote:
I am also on the side of separate repos for different languages. otherwise, the 
main repo can grow too big. iceberg.apache.org<http://iceberg.apache.org> 
website can provide proper links to repos for different languages.

I would be -1 on renaming apache/iceberg to apache/iceberg-java, as it can 
break external links to the main/original github repo. the tradeoff may not be 
worth it.

On Thu, Aug 10, 2023 at 8:16 AM Fokko Driesprong 
<fo...@apache.org<mailto:fo...@apache.org>> wrote:
Hi everyone,

Today I took a stab at the generation of wheels in Python (here's the 
PR<https://github.com/apache/iceberg/pull/8287> if anyone is interested), and 
when testing this it would also kick off many unrelated CI jobs. This is just 
for two languages, and I'm not convinced that it will scale to many languages. 
Also, having a different release cycle for each of the languages will clutter 
up the tags, releases, etc. I'm convinced that separate repositories are more 
scalable in the future, we just have to make sure that they can be found easily 
(rename apache/iceberg to apache/iceberg-java?).

Cheers, Fokko



Op do 10 aug 2023 om 14:18 schreef Jan Kaul <jank...@mailbox.org.invalid>:

Hi all,

first off, thanks Brian for starting the conversation and thanks Renjie for the 
write up.

I'm also in the camp multi-repo because of the already mentioned benefits.

One point I would like to add is that the potential drawback of having less 
visibility with multi-repos can be mitigated to some extent. I think that if 
the different repos are clearly and visibly presented on the iceberg website 
people should be able to find the desired implementation.

Best wishes,

Jan
On 10.08.23 13:43, Brian Olsen wrote:
Renjie, you're amazing.

I think you summarized this better than I could, so thank you for that.

I'd like to pull in a user's feedback on Slack
FWIW, I’m personally a fan of separate repos for the client libraries.
It keeps things more a bit more isolated (in a good way) and explorable (rather 
than overwhelming). GitHub search is a bit easier to use. And I think it 
generally lowers the bar to contributing. Independent versioning, and GitHub 
releases are a big win too, I think.

Right now, I don’t actually know where to find PyIceberg release notes. Would 
love to see release notes in the GitHub releases for them.


IMO, The most important measurement of success for choosing either of these 
options is about making the contributor experience as smooth as possible.

Monorepo has the advantage of one place to look, all changes across 
core/clients can be modeled in a single PR, and sharing resources. At first, I 
considered managing the build to only be a problem for Iceberg committers 
managing the build, but ultimately this is setting us up for a longer build and 
running unnecessary infrastructure for unrelated tasks. There is definitely 
ways that we can verify what parts of the code have been changed and which code 
should be run, but it will not always be clear or simple to know if we tested 
too much or not enough.

For that, I am also in the multi-repo camp (for clients). I think despite 
having to manage different repos for each client, I generally consider the work 
of each client to be independent of the work happening in the main repo. In 
this view, it's possibly better that the work be independent and seen on its 
own. The biggest win IMO is the intentional separation of testing and 
deployment infrastructure. This will make for a better experience when folks 
are contributing, testing, and looking for release notes.

But I also really don't care as long as we do the same things across clients. ;)

Bits


On Thu, Aug 10, 2023 at 2:38 AM Renjie Liu 
<liurenjie2...@gmail.com<mailto:liurenjie2...@gmail.com>> wrote:
Hi, all:

In yesterday’s community sync we talked about the location of different 
language clients, and I think we all agree that there should be consistent 
behavior for these clients, but the decision has not been made yet. I want to 
continue the discussion here on the pros and cons of different sides: mono 
repo(all in one big repo) or multi small repos( one for each language client)

To make things clear, currently we have four language libraries under 
development:


  1.  Java: in main repo(https://github.com/apache/iceberg)
  2.  Python: in main repo (https://github.com/apache/iceberg)
  3.  Go: in main repo (https://github.com/apache/iceberg)
  4.  Rust: in standalone repo (https://github.com/apache/iceberg-rust/)

Currently I mainly contribute rust client and I can share the thoughts on why I 
voted for standalone repo:


  1.  Easier project setup. Iceberg is a complex project with several 
components, and mainly written in java. As someone not quite familiar with this 
project structure, I feel easier to start a new one rather fitting into an 
existing one.
  2.  Faster ci workflow. In early days of rust client’s development, we only 
need to touch rust related code. If we all live in one mono repo, it will 
trigger unnecessary ci to run for other components.

I admit that these reasons may not stand for long term maintains, but it’s good 
for fast-paced development in early days.

After reviewing some discussions on the web, I have a summary about the pros 
and cons of two sides:

Mono Repo

Pros

  *   Visibility and transparency. It would be easier to follow progresses of 
all clients, and prs can have more reviews and attractions.
  *   Easier sharing of resources. It would be easier to share resources for 
integration tests.
Cons

  *   Increases complexity of project structure. The project structure would be 
more complex when coupling different languages and toolchain setup.
  *   Longer build/ci time.  Unnecessary ci checks maybe triggered for small 
prs in different languages.

Multi Repo

Pros

  *   Simplifies project structure. Different language may have toolchains and 
project setup, one repo for one language makes project structure easier to 
understand and follow.
  *   Independent versioning and releases. Different language may have 
different versioning and releases process. It’s also possible in monorepo, but 
I guess it would be easier in standalone multi repo.
  *   Improved build/ci time. No unnecessary ci checks will be triggered.
Cons

  *   Difficult to track the overall progress. Multi repos makes it harder to 
track what’s happening in different teams.
  *   Difficult to share common resources. It maybe more difficult to share 
resources and do integration tests cross different languages.


Welcome to share your ideas and thoughts in this discussion!

References


  1.  
https://www.coforge.com/blog/mono-repo-vs.-multi-repo-in-git-unravelling-the-key-differences


--
Ryan Blue
Tabular

Reply via email to