Hi devs, *Backgorund*
In the past few weeks I pay some time on Pulsar SQL or name it Pulsar Trino connector. I noticed in Trino community our committer Marvin (@xxc) ever submitted a PR to contribute the connector to upstream[1]. However, due to the huge version gap and lack of time to spend on that topic, it stalls about ten months ago. At the moment the latest Pulsar version was 2.8.0 while we're preparing 2.11.0 now. Besides, in Pulsar main repo we make several changes for the connector, especially add Decimal support incompatible with upstream changes. Then I have an idea to bump the version of Trino to migrate code from Pulsar side[2]. Matteo reached out to me that it can be a worthy work to remove the hard-dependency on Presto and if it’s going to take longer time to get Trino to accept and merge it, move the Presto connector to a separate repository, still within the Pulsar project. I try to prototype the work of this idea this week[3] and get more insights about it. So I'd like to start this discussion to find other contributors interested in this topic, figure out if moving Pulsar SQL to a separated repository is a good idea, and discuss a few concrete challenges. *Motivation* The strongest motivation to move Pulsar SQL to a separated repository is that even if a Pulsar user never uses Pulsar SQL, those libs are included in the release tarball and take over 50% space of the distribution. This is similar to the Pulsar Docker image. Another motivation is that we can simplify the codebase of main repo with this movement. I found repo pulsar-connectors[4] and pulsar-presto[5] exists but we didn't push them forward. Pulsar SQL modules are relatively stable and deserved for its own life. *Scope* The technique part for moving Pulsar SQL to a separated repository includes: 1. Moving out the source code. 2. Adapt Pulsar distribution logic so that we no longer includes Pulsar SQL libs. 3. Adapt Docker build logic for the same purpose. 4. Define a release strategy for Pulsar SQL especially Pulsar compatibility policy. 5. Build Pulsar SQL release tarball and Docker image. 6. Wire integration tests with the new repo model. It contains several non-trivial works. I'll show you one by one in the next section. *Challenges* Fortunately, moving out the source code and generate a distribution can be done trivially based on our existing codebase. The prototype[3] achieves this goal, while it will take some efforts to use the distribution out-of-box. > Adapt Pulsar distribution logic so that we no longer includes Pulsar SQL libs. We can remove the modules and related packaging logic. However, we may retain the "sql-worker" and "sql" command in `pulsar` script so that if users download Pulsar SQL tarball and extract them under lib/presto, the experience won't change. This is similar to how users download connectors and place them under connectors folder. > Adapt Docker build logic for the same purpose. I'm unfamiliar with Docker so I don't even know current logics. The most challenging part should be the test images (tests/docker-images). > Define a release strategy for Pulsar SQL especially Pulsar compatibility policy. We can release Pulsar SQL (may be renamed to Pulsar Trino Connector) from 1.0. It's quite clumsy to do simultaneous release among multiple repos. ***Also, this proposal can require a major version bump as it makes breaking changes.*** In the prototype, I have to compile Pulsar 2.11.0-SNAPSHOT instead of setup Pulsar version to 2.10.1, because latest Pulsar SQL uses LedgerOffloaderStats which is unreleased yet. This is biggest challenge that stops me from moving out Pulsar SQL. If the development is tightly coupled, developing and debugging cross repositories would be a nightmare :( > Wire integration tests with the new repo model. So far, the prototype cannot pass all tests. Failed tests include: * Tests depends on MockedPulsarServiceBaseTest. Besides, integration tests in Pulsar (CI - System - SQL) will break after moving out Pulsar SQL. As described above, I don't know how to adjust the test Docker image to overcome this issue. Do you think it's a good idea to move Pulsar SQL to a separated repository? Did you meet something less than awesome because Pulsar SQL is inside the main repo? Looking forward to your feedback on this topic. And perhaps you can help figure out why tests depends on MockedPulsarServiceBaseTest failed on the prototype[3]. Best, tison. [1] https://github.com/trinodb/trino/pull/8020 [2] https://github.com/apache/pulsar/pull/16494 [3] https://github.com/tisonkun/pulsar-trino [4] https://github.com/apache/pulsar-connectors [5] https://github.com/apache/pulsar-presto