Thanks for picking up this task. The decision to move Pulsar SQL out of apache/pulsar repository has been made over 2 years ago in April 2020 with PIP-62, https://github.com/apache/pulsar/wiki/PIP-62%3A-Move-connectors%2C-adapters-and-Pulsar-Presto-to-separate-repositories . It's not only about moving Pulsar SQL out of apache/pulsar repository, but also includes Pulsar IO connectors and Pulsar adapters (already moved to https://github.com/apache/pulsar-adapters).
> 1. Moving out the source code. > 2. Adapt Pulsar distribution logic so that we no longer includes Pulsar SQL > libs. > 3. Adapt Docker build logic for the same purpose. > 4. Define a release strategy for Pulsar SQL especially Pulsar compatibility > policy. > 5. Build Pulsar SQL release tarball and Docker image. > 6. Wire integration tests with the new repo model. regarding "6. Wire integration tests with the new repo model.", the integration tests referencing Pulsar SQL should also be moved out of apache/pulsar repository so that there's no dependency on Pulsar SQL in apache/pulsar. This also applies to "3. Adapt Docker build logic for the same purpose.", nothing in apache/pulsar repository docker images should depend on Pulsar SQL. I guess that the challenge is releasing something that is equivalent to the current "pulsar-all" docker image. It should not be handled in the apache/pulsar repository. We would need a new repository "pulsar-all-distribution" (or "pulsar-distribution" to make the name shorter). That repository could include the docker building logic and the integration tests that require the pulsar-all docker image that includes both Pulsar SQL and Pulsar IO Connectors. It might not be an optimal solution to run the integration tests against pulsar-all image only in the pulsar-all-distribution. It would be better if Pulsar SQL integration tests could run in the pulsar-sql repository and run with a docker image that can be quickly built to support Pulsar SQL integration tests. A possible solution could be that the same integration tests are run against the actual pulsar-all docker image in the "pulsar-distribution" repository to ensure that the tests also pass in full integration. This might be useful for validating the pulsar-all docker image release. We also would like to remove Pulsar IO from apache/pulsar and move it to a separate repository. This goal should be considered when we start making the changes. To summarize: when we are removing Pulsar SQL from apache/pulsar, it also means that the pulsar-all docker image is no more built as part of apache/pulsar builds and it requires a replacement (or dropping pulsar-all docker image completely). -Lari On 2022/07/15 12:11:32 tison wrote: > Hi devs, > > *Backgorund* > > In the past few weeks I pay some time on Pulsar SQL or name it Pulsar Trino > connector. > > I noticed in Trino community our committer Marvin (@xxc) ever submitted a > PR to contribute the connector to upstream[1]. However, due to the huge > version gap and lack of time to spend on that topic, it stalls about ten > months ago. > > At the moment the latest Pulsar version was 2.8.0 while we're preparing > 2.11.0 now. Besides, in Pulsar main repo we make several changes for the > connector, especially add Decimal support incompatible with upstream > changes. > > Then I have an idea to bump the version of Trino to migrate code from > Pulsar side[2]. Matteo reached out to me that it can be a worthy work > to remove the hard-dependency on Presto and if it’s going to take longer > time to get Trino to accept and merge it, move the Presto connector to a > separate repository, still within the Pulsar project. > > I try to prototype the work of this idea this week[3] and get more insights > about it. So I'd like to start this discussion to find other contributors > interested in this topic, figure out if moving Pulsar SQL to a separated > repository is a good idea, and discuss a few concrete challenges. > > *Motivation* > > The strongest motivation to move Pulsar SQL to a separated repository is > that even if a Pulsar user never uses Pulsar SQL, those libs are included > in the release tarball and take over 50% space of the distribution. This is > similar to the Pulsar Docker image. > > Another motivation is that we can simplify the codebase of main repo with > this movement. I found repo pulsar-connectors[4] and pulsar-presto[5] > exists but we didn't push them forward. > > Pulsar SQL modules are relatively stable and deserved for its own life. > > *Scope* > > The technique part for moving Pulsar SQL to a separated repository includes: > > 1. Moving out the source code. > 2. Adapt Pulsar distribution logic so that we no longer includes Pulsar SQL > libs. > 3. Adapt Docker build logic for the same purpose. > 4. Define a release strategy for Pulsar SQL especially Pulsar compatibility > policy. > 5. Build Pulsar SQL release tarball and Docker image. > 6. Wire integration tests with the new repo model. > > It contains several non-trivial works. I'll show you one by one in the next > section. > > *Challenges* > > Fortunately, moving out the source code and generate a distribution can be > done trivially based on our existing codebase. The prototype[3] achieves > this goal, while it will take some efforts to use the distribution > out-of-box. > > > Adapt Pulsar distribution logic so that we no longer includes Pulsar SQL > libs. > > We can remove the modules and related packaging logic. However, we may > retain the "sql-worker" and "sql" command in `pulsar` script so that if > users download Pulsar SQL tarball and extract them under lib/presto, the > experience won't change. This is similar to how users download connectors > and place them under connectors folder. > > > Adapt Docker build logic for the same purpose. > > I'm unfamiliar with Docker so I don't even know current logics. The most > challenging part should be the test images (tests/docker-images). > > > Define a release strategy for Pulsar SQL especially Pulsar compatibility > policy. > > We can release Pulsar SQL (may be renamed to Pulsar Trino Connector) from > 1.0. It's quite clumsy to do simultaneous release among multiple repos. > > ***Also, this proposal can require a major version bump as it makes > breaking changes.*** > > In the prototype, I have to compile Pulsar 2.11.0-SNAPSHOT instead of setup > Pulsar version to 2.10.1, because latest Pulsar SQL > uses LedgerOffloaderStats which is unreleased yet. > > This is biggest challenge that stops me from moving out Pulsar SQL. If the > development is tightly coupled, developing and debugging cross repositories > would be a nightmare :( > > > Wire integration tests with the new repo model. > > So far, the prototype cannot pass all tests. Failed tests include: > > * Tests depends on MockedPulsarServiceBaseTest. > > Besides, integration tests in Pulsar (CI - System - SQL) will break after > moving out Pulsar SQL. As described above, I don't know how to adjust the > test Docker image to overcome this issue. > > Do you think it's a good idea to move Pulsar SQL to a separated repository? > Did you meet something less than awesome because Pulsar SQL is inside the > main repo? > > Looking forward to your feedback on this topic. And perhaps you can help > figure out why tests depends on MockedPulsarServiceBaseTest failed on the > prototype[3]. > > Best, > tison. > > [1] https://github.com/trinodb/trino/pull/8020 > [2] https://github.com/apache/pulsar/pull/16494 > [3] https://github.com/tisonkun/pulsar-trino > [4] https://github.com/apache/pulsar-connectors > [5] https://github.com/apache/pulsar-presto >