Hi all,

first off, thanks Brian for starting the conversation and thanks Renjie for the write up.

I'm also in the camp multi-repo because of the already mentioned benefits.

One point I would like to add is that the potential drawback of having less visibility with multi-repos can be mitigated to some extent. I think that if the different repos are clearly and visibly presented on the iceberg website people should be able to find the desired implementation.

Best wishes,

Jan

On 10.08.23 13:43, Brian Olsen wrote:
Renjie, you're amazing.

I think you summarized this better than I could, so thank you for that.

I'd like to pull in a user's feedback on Slack

    FWIW, I’m personally a fan of separate repos for the client libraries.
    It keeps things more a bit more isolated (in a good way) and
    explorable (rather than overwhelming). GitHub search is a bit
    easier to use. And I think it generally lowers the bar to
    contributing. Independent versioning, and GitHub releases are a
    big win too, I think.


    Right now, I don’t actually know where to find PyIceberg release
    notes. Would love to see release notes in the GitHub releases for
    them.


IMO, The most important measurement of success for choosing either of these options is about making the contributor experience as smooth as possible.

Monorepo has the advantage of one place to look, all changes across core/clients can be modeled in a single PR, and sharing resources. At first, I considered managing the build to only be a problem for Iceberg committers managing the build, but ultimately this is setting us up for a longer build and running unnecessary infrastructure for unrelated tasks. There is definitely ways that we can verify what parts of the code have been changed and which code should be run, but it will not always be clear or simple to know if we tested too much or not enough.

For that, I am also in the multi-repo camp (for clients). I think despite having to manage different repos for each client, I generally consider the work of each client to be independent of the work happening in the main repo. In this view, it's possibly better that the work be independent and seen on its own. The biggest win IMO is the intentional separation of testing and deployment infrastructure. This will make for a better experience when folks are contributing, testing, and looking for release notes.

But I also really don't care as long as we do the same things across clients. ;)

Bits


On Thu, Aug 10, 2023 at 2:38 AM Renjie Liu <liurenjie2...@gmail.com> wrote:

    Hi, all:

    In yesterday’s community sync we talked about the location of
    different language clients, and I think we all agree that there
    should be consistent behavior for these clients, but the decision
    has not been made yet. I want to continue the discussion here on
    the pros and cons of different sides: mono repo(all in one big
    repo) or multi small repos( one for each language client)

    To make things clear, currently we have four language libraries
    under development:

     1. Java: in main repo(https://github.com/apache/iceberg)
     2. Python: in main repo (https://github.com/apache/iceberg)
     3. Go: in main repo (https://github.com/apache/iceberg)
     4. Rust: in standalone repo (https://github.com/apache/iceberg-rust/)

    Currently I mainly contribute rust client and I can share the
    thoughts on why I voted for standalone repo:

     1. Easier project setup. Iceberg is a complex project with
        several components, and mainly written in java. As someone not
        quite familiar with this project structure, I feel easier to
        start a new one rather fitting into an existing one.
     2. Faster ci workflow. In early days of rust client’s
        development, we only need to touch rust related code. If we
        all live in one mono repo, it will trigger unnecessary ci to
        run for other components.

    I admit that these reasons may not stand for long term maintains,
    but it’s good for fast-paced development in early days.

    After reviewing some discussions on the web, I have a summary
    about the pros and cons of two sides:

    Mono Repo

    Pros

      * *Visibility and transparency*. It would be easier to follow
        progresses of all clients, and prs can have more reviews and
        attractions.
      * *Easier sharing of resources*. It would be easier to share
        resources for integration tests.

    Cons

      * *Increases complexity of project structure*. The project
        structure would be more complex when coupling different
        languages and toolchain setup.
      * *Longer build/ci time. *Unnecessary ci checks maybe triggered
        for small prs in different languages.

    **

    Multi Repo

    Pros

      * *Simplifies project structure*. Different language may have
        toolchains and project setup, one repo for one language makes
        project structure easier to understand and follow.
      * *Independent versioning and releases*. Different language may
        have different versioning and releases process. It’s also
        possible in monorepo, but I guess it would be easier in
        standalone multi repo.
      * *Improved build/ci time*. No unnecessary ci checks will be
        triggered.

    Cons

      * *Difficult to track the overall progress. *Multi repos makes
        it harder to track what’s happening in different teams.
      * *Difficult to share common resources.*It maybe more difficult
        to share resources and do integration tests cross different
        languages.

    Welcome to share your ideas and thoughts in this discussion!

    References

     1. 
https://www.coforge.com/blog/mono-repo-vs.-multi-repo-in-git-unravelling-the-key-differences

Reply via email to