Thanks everyone for nice discussion. +1 for multi repo while keeping core spec and java implementation in apache/iceberg. Currently java is still most widely adopted and sophisticated implementation. We only need to help people to find other implementation by providing links in web page.
From: Ryan Blue <b...@tabular.io> Date: Friday, August 11, 2023 at 05:23 To: dev@iceberg.apache.org <dev@iceberg.apache.org> Subject: Re: Discussion about the location of language clients I wasn't at the discussion on Wednesday, but it sounds like there is support for moving to separate repos. Does anyone strongly object? I also agree with Steven on not renaming to iceberg-java. That's the repo where we keep the spec and Java is the reference implementation. Plus we don't want to break a ton of links. Ryan On Thu, Aug 10, 2023 at 1:05 PM Steven Wu <stevenz...@gmail.com<mailto:stevenz...@gmail.com>> wrote: I am also on the side of separate repos for different languages. otherwise, the main repo can grow too big. iceberg.apache.org<http://iceberg.apache.org> website can provide proper links to repos for different languages. I would be -1 on renaming apache/iceberg to apache/iceberg-java, as it can break external links to the main/original github repo. the tradeoff may not be worth it. On Thu, Aug 10, 2023 at 8:16 AM Fokko Driesprong <fo...@apache.org<mailto:fo...@apache.org>> wrote: Hi everyone, Today I took a stab at the generation of wheels in Python (here's the PR<https://github.com/apache/iceberg/pull/8287> if anyone is interested), and when testing this it would also kick off many unrelated CI jobs. This is just for two languages, and I'm not convinced that it will scale to many languages. Also, having a different release cycle for each of the languages will clutter up the tags, releases, etc. I'm convinced that separate repositories are more scalable in the future, we just have to make sure that they can be found easily (rename apache/iceberg to apache/iceberg-java?). Cheers, Fokko Op do 10 aug 2023 om 14:18 schreef Jan Kaul <jank...@mailbox.org.invalid>: Hi all, first off, thanks Brian for starting the conversation and thanks Renjie for the write up. I'm also in the camp multi-repo because of the already mentioned benefits. One point I would like to add is that the potential drawback of having less visibility with multi-repos can be mitigated to some extent. I think that if the different repos are clearly and visibly presented on the iceberg website people should be able to find the desired implementation. Best wishes, Jan On 10.08.23 13:43, Brian Olsen wrote: Renjie, you're amazing. I think you summarized this better than I could, so thank you for that. I'd like to pull in a user's feedback on Slack FWIW, I’m personally a fan of separate repos for the client libraries. It keeps things more a bit more isolated (in a good way) and explorable (rather than overwhelming). GitHub search is a bit easier to use. And I think it generally lowers the bar to contributing. Independent versioning, and GitHub releases are a big win too, I think. Right now, I don’t actually know where to find PyIceberg release notes. Would love to see release notes in the GitHub releases for them. IMO, The most important measurement of success for choosing either of these options is about making the contributor experience as smooth as possible. Monorepo has the advantage of one place to look, all changes across core/clients can be modeled in a single PR, and sharing resources. At first, I considered managing the build to only be a problem for Iceberg committers managing the build, but ultimately this is setting us up for a longer build and running unnecessary infrastructure for unrelated tasks. There is definitely ways that we can verify what parts of the code have been changed and which code should be run, but it will not always be clear or simple to know if we tested too much or not enough. For that, I am also in the multi-repo camp (for clients). I think despite having to manage different repos for each client, I generally consider the work of each client to be independent of the work happening in the main repo. In this view, it's possibly better that the work be independent and seen on its own. The biggest win IMO is the intentional separation of testing and deployment infrastructure. This will make for a better experience when folks are contributing, testing, and looking for release notes. But I also really don't care as long as we do the same things across clients. ;) Bits On Thu, Aug 10, 2023 at 2:38 AM Renjie Liu <liurenjie2...@gmail.com<mailto:liurenjie2...@gmail.com>> wrote: Hi, all: In yesterday’s community sync we talked about the location of different language clients, and I think we all agree that there should be consistent behavior for these clients, but the decision has not been made yet. I want to continue the discussion here on the pros and cons of different sides: mono repo(all in one big repo) or multi small repos( one for each language client) To make things clear, currently we have four language libraries under development: 1. Java: in main repo(https://github.com/apache/iceberg) 2. Python: in main repo (https://github.com/apache/iceberg) 3. Go: in main repo (https://github.com/apache/iceberg) 4. Rust: in standalone repo (https://github.com/apache/iceberg-rust/) Currently I mainly contribute rust client and I can share the thoughts on why I voted for standalone repo: 1. Easier project setup. Iceberg is a complex project with several components, and mainly written in java. As someone not quite familiar with this project structure, I feel easier to start a new one rather fitting into an existing one. 2. Faster ci workflow. In early days of rust client’s development, we only need to touch rust related code. If we all live in one mono repo, it will trigger unnecessary ci to run for other components. I admit that these reasons may not stand for long term maintains, but it’s good for fast-paced development in early days. After reviewing some discussions on the web, I have a summary about the pros and cons of two sides: Mono Repo Pros * Visibility and transparency. It would be easier to follow progresses of all clients, and prs can have more reviews and attractions. * Easier sharing of resources. It would be easier to share resources for integration tests. Cons * Increases complexity of project structure. The project structure would be more complex when coupling different languages and toolchain setup. * Longer build/ci time. Unnecessary ci checks maybe triggered for small prs in different languages. Multi Repo Pros * Simplifies project structure. Different language may have toolchains and project setup, one repo for one language makes project structure easier to understand and follow. * Independent versioning and releases. Different language may have different versioning and releases process. It’s also possible in monorepo, but I guess it would be easier in standalone multi repo. * Improved build/ci time. No unnecessary ci checks will be triggered. Cons * Difficult to track the overall progress. Multi repos makes it harder to track what’s happening in different teams. * Difficult to share common resources. It maybe more difficult to share resources and do integration tests cross different languages. Welcome to share your ideas and thoughts in this discussion! References 1. https://www.coforge.com/blog/mono-repo-vs.-multi-repo-in-git-unravelling-the-key-differences -- Ryan Blue Tabular