Hi all,
first off, thanks Brian for starting the conversation and thanks Renjie
for the write up.
I'm also in the camp multi-repo because of the already mentioned benefits.
One point I would like to add is that the potential drawback of having
less visibility with multi-repos can be mitigated to some extent. I
think that if the different repos are clearly and visibly presented on
the iceberg website people should be able to find the desired
implementation.
Best wishes,
Jan
On 10.08.23 13:43, Brian Olsen wrote:
Renjie, you're amazing.
I think you summarized this better than I could, so thank you for that.
I'd like to pull in a user's feedback on Slack
FWIW, I’m personally a fan of separate repos for the client libraries.
It keeps things more a bit more isolated (in a good way) and
explorable (rather than overwhelming). GitHub search is a bit
easier to use. And I think it generally lowers the bar to
contributing. Independent versioning, and GitHub releases are a
big win too, I think.
Right now, I don’t actually know where to find PyIceberg release
notes. Would love to see release notes in the GitHub releases for
them.
IMO, The most important measurement of success for choosing either of
these options is about making the contributor experience as smooth as
possible.
Monorepo has the advantage of one place to look, all changes across
core/clients can be modeled in a single PR, and sharing resources. At
first, I considered managing the build to only be a problem for
Iceberg committers managing the build, but ultimately this is setting
us up for a longer build and running unnecessary infrastructure for
unrelated tasks. There is definitely ways that we can verify what
parts of the code have been changed and which code should be run, but
it will not always be clear or simple to know if we tested too much or
not enough.
For that, I am also in the multi-repo camp (for clients). I think
despite having to manage different repos for each client, I generally
consider the work of each client to be independent of the work
happening in the main repo. In this view, it's possibly better that
the work be independent and seen on its own. The biggest win IMO is
the intentional separation of testing and deployment infrastructure.
This will make for a better experience when folks are contributing,
testing, and looking for release notes.
But I also really don't care as long as we do the same things across
clients. ;)
Bits
On Thu, Aug 10, 2023 at 2:38 AM Renjie Liu <liurenjie2...@gmail.com>
wrote:
Hi, all:
In yesterday’s community sync we talked about the location of
different language clients, and I think we all agree that there
should be consistent behavior for these clients, but the decision
has not been made yet. I want to continue the discussion here on
the pros and cons of different sides: mono repo(all in one big
repo) or multi small repos( one for each language client)
To make things clear, currently we have four language libraries
under development:
1. Java: in main repo(https://github.com/apache/iceberg)
2. Python: in main repo (https://github.com/apache/iceberg)
3. Go: in main repo (https://github.com/apache/iceberg)
4. Rust: in standalone repo (https://github.com/apache/iceberg-rust/)
Currently I mainly contribute rust client and I can share the
thoughts on why I voted for standalone repo:
1. Easier project setup. Iceberg is a complex project with
several components, and mainly written in java. As someone not
quite familiar with this project structure, I feel easier to
start a new one rather fitting into an existing one.
2. Faster ci workflow. In early days of rust client’s
development, we only need to touch rust related code. If we
all live in one mono repo, it will trigger unnecessary ci to
run for other components.
I admit that these reasons may not stand for long term maintains,
but it’s good for fast-paced development in early days.
After reviewing some discussions on the web, I have a summary
about the pros and cons of two sides:
Mono Repo
Pros
* *Visibility and transparency*. It would be easier to follow
progresses of all clients, and prs can have more reviews and
attractions.
* *Easier sharing of resources*. It would be easier to share
resources for integration tests.
Cons
* *Increases complexity of project structure*. The project
structure would be more complex when coupling different
languages and toolchain setup.
* *Longer build/ci time. *Unnecessary ci checks maybe triggered
for small prs in different languages.
**
Multi Repo
Pros
* *Simplifies project structure*. Different language may have
toolchains and project setup, one repo for one language makes
project structure easier to understand and follow.
* *Independent versioning and releases*. Different language may
have different versioning and releases process. It’s also
possible in monorepo, but I guess it would be easier in
standalone multi repo.
* *Improved build/ci time*. No unnecessary ci checks will be
triggered.
Cons
* *Difficult to track the overall progress. *Multi repos makes
it harder to track what’s happening in different teams.
* *Difficult to share common resources.*It maybe more difficult
to share resources and do integration tests cross different
languages.
Welcome to share your ideas and thoughts in this discussion!
References
1.
https://www.coforge.com/blog/mono-repo-vs.-multi-repo-in-git-unravelling-the-key-differences