Hi Lari,

Thanks for your feedback!

A quick update is that unit test now passed in the prototype repo[1].

Reading the content above, I may phrase the major work in this direction as:

* Package Pulsar and Pulsar SQL correctly both in all and separated.
* Upgrade PrestoSQL version so that we can get rid of old version of
dependencies.

Although with the prototype, it seems simply moving out the codebase
doesn't break code, but we still meet challenges to retain integration
tests and to package artifacts correctly.

Instead of firstly creating a new repository and leave it uncompleted
later, since I don't see a block to improve packaging logics and bumping
version in the main repo, I will try to push this work in the main repo at
first. And when we reach a situation that the packaging process are already
loose coupled, we can move out those code trivially.

Besides that, pulsar presto distribution is baked in image "pulsar",
"pulsar-all" adds builtin connectors and offloaders.

Respect to the integration tests, I cannot run integration tests locally
yet :/

Best,
tison.

[1] https://github.com/tisonkun/pulsar-trino


Lari Hotari <lhot...@apache.org> 于2022年7月18日周一 15:11写道:

> Thanks for picking up this task. The decision to move Pulsar SQL out of
> apache/pulsar repository has been made over 2 years ago in April 2020 with
> PIP-62,
> https://github.com/apache/pulsar/wiki/PIP-62%3A-Move-connectors%2C-adapters-and-Pulsar-Presto-to-separate-repositories
> . It's not only about moving Pulsar SQL out of apache/pulsar repository,
> but also includes Pulsar IO connectors and Pulsar adapters (already moved
> to https://github.com/apache/pulsar-adapters).
>
> > 1. Moving out the source code.
> > 2. Adapt Pulsar distribution logic so that we no longer includes Pulsar
> SQL
> > libs.
> > 3. Adapt Docker build logic for the same purpose.
> > 4. Define a release strategy for Pulsar SQL especially Pulsar
> compatibility
> > policy.
> > 5. Build Pulsar SQL release tarball and Docker image.
> > 6. Wire integration tests with the new repo model.
>
> regarding "6. Wire integration tests with the new repo model.", the
> integration tests referencing Pulsar SQL should also be moved out of
> apache/pulsar repository so that there's no dependency on Pulsar SQL in
> apache/pulsar. This also applies to "3. Adapt Docker build logic for the
> same purpose.", nothing in apache/pulsar repository docker images should
> depend on Pulsar SQL.
>
> I guess that the challenge is releasing something that is equivalent to
> the current "pulsar-all" docker image. It should not be handled in the
> apache/pulsar repository. We would need a new repository
> "pulsar-all-distribution" (or "pulsar-distribution" to make the name
> shorter). That repository could include the docker building logic and the
> integration tests that require the pulsar-all docker image that includes
> both Pulsar SQL and Pulsar IO Connectors.
>
> It might not be an optimal solution to run the integration tests against
> pulsar-all image only in the pulsar-all-distribution. It would be better if
> Pulsar SQL integration tests could run in the pulsar-sql repository and run
> with a docker image that can be quickly built to support Pulsar SQL
> integration tests. A possible solution could be that the same integration
> tests are run against the actual pulsar-all docker image in the
> "pulsar-distribution" repository to ensure that the tests also pass in full
> integration. This might be useful for validating the pulsar-all docker
> image release.
>
> We also would like to remove Pulsar IO from apache/pulsar and move it to a
> separate repository. This goal should be considered when we start making
> the changes.
>
> To summarize: when we are removing Pulsar SQL from apache/pulsar, it also
> means that the pulsar-all docker image is no more built as part of
> apache/pulsar builds and it requires a replacement (or dropping pulsar-all
> docker image completely).
>
> -Lari
>
>
> On 2022/07/15 12:11:32 tison wrote:
> > Hi devs,
> >
> > *Backgorund*
> >
> > In the past few weeks I pay some time on Pulsar SQL or name it Pulsar
> Trino
> > connector.
> >
> > I noticed in Trino community our committer Marvin (@xxc) ever submitted a
> > PR to contribute the connector to upstream[1]. However, due to the huge
> > version gap and lack of time to spend on that topic, it stalls about ten
> > months ago.
> >
> > At the moment the latest Pulsar version was 2.8.0 while we're preparing
> > 2.11.0 now. Besides, in Pulsar main repo we make several changes for the
> > connector, especially add Decimal support incompatible with upstream
> > changes.
> >
> > Then I have an idea to bump the version of Trino to migrate code from
> > Pulsar side[2]. Matteo reached out to me that it can be a worthy work
> > to remove the hard-dependency on Presto and if it’s going to take longer
> > time to get Trino to accept and merge it, move the Presto connector to a
> > separate repository, still within the Pulsar project.
> >
> > I try to prototype the work of this idea this week[3] and get more
> insights
> > about it. So I'd like to start this discussion to find other contributors
> > interested in this topic, figure out if moving Pulsar SQL to a separated
> > repository is a good idea, and discuss a few concrete challenges.
> >
> > *Motivation*
> >
> > The strongest motivation to move Pulsar SQL to a separated repository is
> > that even if a Pulsar user never uses Pulsar SQL, those libs are included
> > in the release tarball and take over 50% space of the distribution. This
> is
> > similar to the Pulsar Docker image.
> >
> > Another motivation is that we can simplify the codebase of main repo with
> > this movement. I found repo pulsar-connectors[4] and pulsar-presto[5]
> > exists but we didn't push them forward.
> >
> > Pulsar SQL modules are relatively stable and deserved for its own life.
> >
> > *Scope*
> >
> > The technique part for moving Pulsar SQL to a separated repository
> includes:
> >
> > 1. Moving out the source code.
> > 2. Adapt Pulsar distribution logic so that we no longer includes Pulsar
> SQL
> > libs.
> > 3. Adapt Docker build logic for the same purpose.
> > 4. Define a release strategy for Pulsar SQL especially Pulsar
> compatibility
> > policy.
> > 5. Build Pulsar SQL release tarball and Docker image.
> > 6. Wire integration tests with the new repo model.
> >
> > It contains several non-trivial works. I'll show you one by one in the
> next
> > section.
> >
> > *Challenges*
> >
> > Fortunately, moving out the source code and generate a distribution can
> be
> > done trivially based on our existing codebase. The prototype[3] achieves
> > this goal, while it will take some efforts to use the distribution
> > out-of-box.
> >
> > > Adapt Pulsar distribution logic so that we no longer includes Pulsar
> SQL
> > libs.
> >
> > We can remove the modules and related packaging logic. However, we may
> > retain the "sql-worker" and "sql" command in `pulsar` script so that if
> > users download Pulsar SQL tarball and extract them under lib/presto, the
> > experience won't change. This is similar to how users download connectors
> > and place them under connectors folder.
> >
> > > Adapt Docker build logic for the same purpose.
> >
> > I'm unfamiliar with Docker so I don't even know current logics. The most
> > challenging part should be the test images (tests/docker-images).
> >
> > > Define a release strategy for Pulsar SQL especially Pulsar
> compatibility
> > policy.
> >
> > We can release Pulsar SQL (may be renamed to Pulsar Trino Connector) from
> > 1.0. It's quite clumsy to do simultaneous release among multiple repos.
> >
> > ***Also, this proposal can require a major version bump as it makes
> > breaking changes.***
> >
> > In the prototype, I have to compile Pulsar 2.11.0-SNAPSHOT instead of
> setup
> > Pulsar version to 2.10.1, because latest Pulsar SQL
> > uses LedgerOffloaderStats which is unreleased yet.
> >
> > This is biggest challenge that stops me from moving out Pulsar SQL. If
> the
> > development is tightly coupled, developing and debugging cross
> repositories
> > would be a nightmare :(
> >
> > > Wire integration tests with the new repo model.
> >
> > So far, the prototype cannot pass all tests. Failed tests include:
> >
> > * Tests depends on MockedPulsarServiceBaseTest.
> >
> > Besides, integration tests in Pulsar (CI - System - SQL) will break after
> > moving out Pulsar SQL. As described above, I don't know how to adjust the
> > test Docker image to overcome this issue.
> >
> > Do you think it's a good idea to move Pulsar SQL to a separated
> repository?
> > Did you meet something less than awesome because Pulsar SQL is inside the
> > main repo?
> >
> > Looking forward to your feedback on this topic. And perhaps you can help
> > figure out why tests depends on MockedPulsarServiceBaseTest failed on the
> > prototype[3].
> >
> > Best,
> > tison.
> >
> > [1] https://github.com/trinodb/trino/pull/8020
> > [2] https://github.com/apache/pulsar/pull/16494
> > [3] https://github.com/tisonkun/pulsar-trino
> > [4] https://github.com/apache/pulsar-connectors
> > [5] https://github.com/apache/pulsar-presto
> >
>

Reply via email to