Here[1] I prepare a patch to upgrade PrestoSQL dependency to Trino. If we have several PMC attentions on this part of work, no matter whether a PIP is needed, we can use the opportunity renaming presto to trino to move `conf/presto` to `trino/conf`, and `lib/presto/*` to `trino/*`.
This gather all Pulsar SQL snippet from among the PULSAR_HOME folder, so that we can later smoothly move out the Pulsar SQL components. Best, tison. [1] https://github.com/apache/pulsar/pull/16683 tison <wander4...@gmail.com> 于2022年7月19日周二 16:14写道: > Hi Lari, > > Thanks for your feedback! > > A quick update is that unit test now passed in the prototype repo[1]. > > Reading the content above, I may phrase the major work in this direction > as: > > * Package Pulsar and Pulsar SQL correctly both in all and separated. > * Upgrade PrestoSQL version so that we can get rid of old version of > dependencies. > > Although with the prototype, it seems simply moving out the codebase > doesn't break code, but we still meet challenges to retain integration > tests and to package artifacts correctly. > > Instead of firstly creating a new repository and leave it uncompleted > later, since I don't see a block to improve packaging logics and bumping > version in the main repo, I will try to push this work in the main repo at > first. And when we reach a situation that the packaging process are already > loose coupled, we can move out those code trivially. > > Besides that, pulsar presto distribution is baked in image "pulsar", > "pulsar-all" adds builtin connectors and offloaders. > > Respect to the integration tests, I cannot run integration tests locally > yet :/ > > Best, > tison. > > [1] https://github.com/tisonkun/pulsar-trino > > > Lari Hotari <lhot...@apache.org> 于2022年7月18日周一 15:11写道: > >> Thanks for picking up this task. The decision to move Pulsar SQL out of >> apache/pulsar repository has been made over 2 years ago in April 2020 with >> PIP-62, >> https://github.com/apache/pulsar/wiki/PIP-62%3A-Move-connectors%2C-adapters-and-Pulsar-Presto-to-separate-repositories >> . It's not only about moving Pulsar SQL out of apache/pulsar repository, >> but also includes Pulsar IO connectors and Pulsar adapters (already moved >> to https://github.com/apache/pulsar-adapters). >> >> > 1. Moving out the source code. >> > 2. Adapt Pulsar distribution logic so that we no longer includes Pulsar >> SQL >> > libs. >> > 3. Adapt Docker build logic for the same purpose. >> > 4. Define a release strategy for Pulsar SQL especially Pulsar >> compatibility >> > policy. >> > 5. Build Pulsar SQL release tarball and Docker image. >> > 6. Wire integration tests with the new repo model. >> >> regarding "6. Wire integration tests with the new repo model.", the >> integration tests referencing Pulsar SQL should also be moved out of >> apache/pulsar repository so that there's no dependency on Pulsar SQL in >> apache/pulsar. This also applies to "3. Adapt Docker build logic for the >> same purpose.", nothing in apache/pulsar repository docker images should >> depend on Pulsar SQL. >> >> I guess that the challenge is releasing something that is equivalent to >> the current "pulsar-all" docker image. It should not be handled in the >> apache/pulsar repository. We would need a new repository >> "pulsar-all-distribution" (or "pulsar-distribution" to make the name >> shorter). That repository could include the docker building logic and the >> integration tests that require the pulsar-all docker image that includes >> both Pulsar SQL and Pulsar IO Connectors. >> >> It might not be an optimal solution to run the integration tests against >> pulsar-all image only in the pulsar-all-distribution. It would be better if >> Pulsar SQL integration tests could run in the pulsar-sql repository and run >> with a docker image that can be quickly built to support Pulsar SQL >> integration tests. A possible solution could be that the same integration >> tests are run against the actual pulsar-all docker image in the >> "pulsar-distribution" repository to ensure that the tests also pass in full >> integration. This might be useful for validating the pulsar-all docker >> image release. >> >> We also would like to remove Pulsar IO from apache/pulsar and move it to >> a separate repository. This goal should be considered when we start making >> the changes. >> >> To summarize: when we are removing Pulsar SQL from apache/pulsar, it also >> means that the pulsar-all docker image is no more built as part of >> apache/pulsar builds and it requires a replacement (or dropping pulsar-all >> docker image completely). >> >> -Lari >> >> >> On 2022/07/15 12:11:32 tison wrote: >> > Hi devs, >> > >> > *Backgorund* >> > >> > In the past few weeks I pay some time on Pulsar SQL or name it Pulsar >> Trino >> > connector. >> > >> > I noticed in Trino community our committer Marvin (@xxc) ever submitted >> a >> > PR to contribute the connector to upstream[1]. However, due to the huge >> > version gap and lack of time to spend on that topic, it stalls about ten >> > months ago. >> > >> > At the moment the latest Pulsar version was 2.8.0 while we're preparing >> > 2.11.0 now. Besides, in Pulsar main repo we make several changes for the >> > connector, especially add Decimal support incompatible with upstream >> > changes. >> > >> > Then I have an idea to bump the version of Trino to migrate code from >> > Pulsar side[2]. Matteo reached out to me that it can be a worthy work >> > to remove the hard-dependency on Presto and if it’s going to take longer >> > time to get Trino to accept and merge it, move the Presto connector to a >> > separate repository, still within the Pulsar project. >> > >> > I try to prototype the work of this idea this week[3] and get more >> insights >> > about it. So I'd like to start this discussion to find other >> contributors >> > interested in this topic, figure out if moving Pulsar SQL to a separated >> > repository is a good idea, and discuss a few concrete challenges. >> > >> > *Motivation* >> > >> > The strongest motivation to move Pulsar SQL to a separated repository is >> > that even if a Pulsar user never uses Pulsar SQL, those libs are >> included >> > in the release tarball and take over 50% space of the distribution. >> This is >> > similar to the Pulsar Docker image. >> > >> > Another motivation is that we can simplify the codebase of main repo >> with >> > this movement. I found repo pulsar-connectors[4] and pulsar-presto[5] >> > exists but we didn't push them forward. >> > >> > Pulsar SQL modules are relatively stable and deserved for its own life. >> > >> > *Scope* >> > >> > The technique part for moving Pulsar SQL to a separated repository >> includes: >> > >> > 1. Moving out the source code. >> > 2. Adapt Pulsar distribution logic so that we no longer includes Pulsar >> SQL >> > libs. >> > 3. Adapt Docker build logic for the same purpose. >> > 4. Define a release strategy for Pulsar SQL especially Pulsar >> compatibility >> > policy. >> > 5. Build Pulsar SQL release tarball and Docker image. >> > 6. Wire integration tests with the new repo model. >> > >> > It contains several non-trivial works. I'll show you one by one in the >> next >> > section. >> > >> > *Challenges* >> > >> > Fortunately, moving out the source code and generate a distribution can >> be >> > done trivially based on our existing codebase. The prototype[3] achieves >> > this goal, while it will take some efforts to use the distribution >> > out-of-box. >> > >> > > Adapt Pulsar distribution logic so that we no longer includes Pulsar >> SQL >> > libs. >> > >> > We can remove the modules and related packaging logic. However, we may >> > retain the "sql-worker" and "sql" command in `pulsar` script so that if >> > users download Pulsar SQL tarball and extract them under lib/presto, the >> > experience won't change. This is similar to how users download >> connectors >> > and place them under connectors folder. >> > >> > > Adapt Docker build logic for the same purpose. >> > >> > I'm unfamiliar with Docker so I don't even know current logics. The most >> > challenging part should be the test images (tests/docker-images). >> > >> > > Define a release strategy for Pulsar SQL especially Pulsar >> compatibility >> > policy. >> > >> > We can release Pulsar SQL (may be renamed to Pulsar Trino Connector) >> from >> > 1.0. It's quite clumsy to do simultaneous release among multiple repos. >> > >> > ***Also, this proposal can require a major version bump as it makes >> > breaking changes.*** >> > >> > In the prototype, I have to compile Pulsar 2.11.0-SNAPSHOT instead of >> setup >> > Pulsar version to 2.10.1, because latest Pulsar SQL >> > uses LedgerOffloaderStats which is unreleased yet. >> > >> > This is biggest challenge that stops me from moving out Pulsar SQL. If >> the >> > development is tightly coupled, developing and debugging cross >> repositories >> > would be a nightmare :( >> > >> > > Wire integration tests with the new repo model. >> > >> > So far, the prototype cannot pass all tests. Failed tests include: >> > >> > * Tests depends on MockedPulsarServiceBaseTest. >> > >> > Besides, integration tests in Pulsar (CI - System - SQL) will break >> after >> > moving out Pulsar SQL. As described above, I don't know how to adjust >> the >> > test Docker image to overcome this issue. >> > >> > Do you think it's a good idea to move Pulsar SQL to a separated >> repository? >> > Did you meet something less than awesome because Pulsar SQL is inside >> the >> > main repo? >> > >> > Looking forward to your feedback on this topic. And perhaps you can help >> > figure out why tests depends on MockedPulsarServiceBaseTest failed on >> the >> > prototype[3]. >> > >> > Best, >> > tison. >> > >> > [1] https://github.com/trinodb/trino/pull/8020 >> > [2] https://github.com/apache/pulsar/pull/16494 >> > [3] https://github.com/tisonkun/pulsar-trino >> > [4] https://github.com/apache/pulsar-connectors >> > [5] https://github.com/apache/pulsar-presto >> > >> >