Here[1] I prepare a patch to upgrade PrestoSQL dependency to Trino.

If we have several PMC attentions on this part of work, no matter whether a
PIP is needed, we can use the opportunity renaming presto to trino to move
`conf/presto` to `trino/conf`, and `lib/presto/*` to `trino/*`.

This gather all Pulsar SQL snippet from among the PULSAR_HOME folder, so
that we can later smoothly move out the Pulsar SQL components.

Best,
tison.

[1] https://github.com/apache/pulsar/pull/16683


tison <wander4...@gmail.com> 于2022年7月19日周二 16:14写道:

> Hi Lari,
>
> Thanks for your feedback!
>
> A quick update is that unit test now passed in the prototype repo[1].
>
> Reading the content above, I may phrase the major work in this direction
> as:
>
> * Package Pulsar and Pulsar SQL correctly both in all and separated.
> * Upgrade PrestoSQL version so that we can get rid of old version of
> dependencies.
>
> Although with the prototype, it seems simply moving out the codebase
> doesn't break code, but we still meet challenges to retain integration
> tests and to package artifacts correctly.
>
> Instead of firstly creating a new repository and leave it uncompleted
> later, since I don't see a block to improve packaging logics and bumping
> version in the main repo, I will try to push this work in the main repo at
> first. And when we reach a situation that the packaging process are already
> loose coupled, we can move out those code trivially.
>
> Besides that, pulsar presto distribution is baked in image "pulsar",
> "pulsar-all" adds builtin connectors and offloaders.
>
> Respect to the integration tests, I cannot run integration tests locally
> yet :/
>
> Best,
> tison.
>
> [1] https://github.com/tisonkun/pulsar-trino
>
>
> Lari Hotari <lhot...@apache.org> 于2022年7月18日周一 15:11写道:
>
>> Thanks for picking up this task. The decision to move Pulsar SQL out of
>> apache/pulsar repository has been made over 2 years ago in April 2020 with
>> PIP-62,
>> https://github.com/apache/pulsar/wiki/PIP-62%3A-Move-connectors%2C-adapters-and-Pulsar-Presto-to-separate-repositories
>> . It's not only about moving Pulsar SQL out of apache/pulsar repository,
>> but also includes Pulsar IO connectors and Pulsar adapters (already moved
>> to https://github.com/apache/pulsar-adapters).
>>
>> > 1. Moving out the source code.
>> > 2. Adapt Pulsar distribution logic so that we no longer includes Pulsar
>> SQL
>> > libs.
>> > 3. Adapt Docker build logic for the same purpose.
>> > 4. Define a release strategy for Pulsar SQL especially Pulsar
>> compatibility
>> > policy.
>> > 5. Build Pulsar SQL release tarball and Docker image.
>> > 6. Wire integration tests with the new repo model.
>>
>> regarding "6. Wire integration tests with the new repo model.", the
>> integration tests referencing Pulsar SQL should also be moved out of
>> apache/pulsar repository so that there's no dependency on Pulsar SQL in
>> apache/pulsar. This also applies to "3. Adapt Docker build logic for the
>> same purpose.", nothing in apache/pulsar repository docker images should
>> depend on Pulsar SQL.
>>
>> I guess that the challenge is releasing something that is equivalent to
>> the current "pulsar-all" docker image. It should not be handled in the
>> apache/pulsar repository. We would need a new repository
>> "pulsar-all-distribution" (or "pulsar-distribution" to make the name
>> shorter). That repository could include the docker building logic and the
>> integration tests that require the pulsar-all docker image that includes
>> both Pulsar SQL and Pulsar IO Connectors.
>>
>> It might not be an optimal solution to run the integration tests against
>> pulsar-all image only in the pulsar-all-distribution. It would be better if
>> Pulsar SQL integration tests could run in the pulsar-sql repository and run
>> with a docker image that can be quickly built to support Pulsar SQL
>> integration tests. A possible solution could be that the same integration
>> tests are run against the actual pulsar-all docker image in the
>> "pulsar-distribution" repository to ensure that the tests also pass in full
>> integration. This might be useful for validating the pulsar-all docker
>> image release.
>>
>> We also would like to remove Pulsar IO from apache/pulsar and move it to
>> a separate repository. This goal should be considered when we start making
>> the changes.
>>
>> To summarize: when we are removing Pulsar SQL from apache/pulsar, it also
>> means that the pulsar-all docker image is no more built as part of
>> apache/pulsar builds and it requires a replacement (or dropping pulsar-all
>> docker image completely).
>>
>> -Lari
>>
>>
>> On 2022/07/15 12:11:32 tison wrote:
>> > Hi devs,
>> >
>> > *Backgorund*
>> >
>> > In the past few weeks I pay some time on Pulsar SQL or name it Pulsar
>> Trino
>> > connector.
>> >
>> > I noticed in Trino community our committer Marvin (@xxc) ever submitted
>> a
>> > PR to contribute the connector to upstream[1]. However, due to the huge
>> > version gap and lack of time to spend on that topic, it stalls about ten
>> > months ago.
>> >
>> > At the moment the latest Pulsar version was 2.8.0 while we're preparing
>> > 2.11.0 now. Besides, in Pulsar main repo we make several changes for the
>> > connector, especially add Decimal support incompatible with upstream
>> > changes.
>> >
>> > Then I have an idea to bump the version of Trino to migrate code from
>> > Pulsar side[2]. Matteo reached out to me that it can be a worthy work
>> > to remove the hard-dependency on Presto and if it’s going to take longer
>> > time to get Trino to accept and merge it, move the Presto connector to a
>> > separate repository, still within the Pulsar project.
>> >
>> > I try to prototype the work of this idea this week[3] and get more
>> insights
>> > about it. So I'd like to start this discussion to find other
>> contributors
>> > interested in this topic, figure out if moving Pulsar SQL to a separated
>> > repository is a good idea, and discuss a few concrete challenges.
>> >
>> > *Motivation*
>> >
>> > The strongest motivation to move Pulsar SQL to a separated repository is
>> > that even if a Pulsar user never uses Pulsar SQL, those libs are
>> included
>> > in the release tarball and take over 50% space of the distribution.
>> This is
>> > similar to the Pulsar Docker image.
>> >
>> > Another motivation is that we can simplify the codebase of main repo
>> with
>> > this movement. I found repo pulsar-connectors[4] and pulsar-presto[5]
>> > exists but we didn't push them forward.
>> >
>> > Pulsar SQL modules are relatively stable and deserved for its own life.
>> >
>> > *Scope*
>> >
>> > The technique part for moving Pulsar SQL to a separated repository
>> includes:
>> >
>> > 1. Moving out the source code.
>> > 2. Adapt Pulsar distribution logic so that we no longer includes Pulsar
>> SQL
>> > libs.
>> > 3. Adapt Docker build logic for the same purpose.
>> > 4. Define a release strategy for Pulsar SQL especially Pulsar
>> compatibility
>> > policy.
>> > 5. Build Pulsar SQL release tarball and Docker image.
>> > 6. Wire integration tests with the new repo model.
>> >
>> > It contains several non-trivial works. I'll show you one by one in the
>> next
>> > section.
>> >
>> > *Challenges*
>> >
>> > Fortunately, moving out the source code and generate a distribution can
>> be
>> > done trivially based on our existing codebase. The prototype[3] achieves
>> > this goal, while it will take some efforts to use the distribution
>> > out-of-box.
>> >
>> > > Adapt Pulsar distribution logic so that we no longer includes Pulsar
>> SQL
>> > libs.
>> >
>> > We can remove the modules and related packaging logic. However, we may
>> > retain the "sql-worker" and "sql" command in `pulsar` script so that if
>> > users download Pulsar SQL tarball and extract them under lib/presto, the
>> > experience won't change. This is similar to how users download
>> connectors
>> > and place them under connectors folder.
>> >
>> > > Adapt Docker build logic for the same purpose.
>> >
>> > I'm unfamiliar with Docker so I don't even know current logics. The most
>> > challenging part should be the test images (tests/docker-images).
>> >
>> > > Define a release strategy for Pulsar SQL especially Pulsar
>> compatibility
>> > policy.
>> >
>> > We can release Pulsar SQL (may be renamed to Pulsar Trino Connector)
>> from
>> > 1.0. It's quite clumsy to do simultaneous release among multiple repos.
>> >
>> > ***Also, this proposal can require a major version bump as it makes
>> > breaking changes.***
>> >
>> > In the prototype, I have to compile Pulsar 2.11.0-SNAPSHOT instead of
>> setup
>> > Pulsar version to 2.10.1, because latest Pulsar SQL
>> > uses LedgerOffloaderStats which is unreleased yet.
>> >
>> > This is biggest challenge that stops me from moving out Pulsar SQL. If
>> the
>> > development is tightly coupled, developing and debugging cross
>> repositories
>> > would be a nightmare :(
>> >
>> > > Wire integration tests with the new repo model.
>> >
>> > So far, the prototype cannot pass all tests. Failed tests include:
>> >
>> > * Tests depends on MockedPulsarServiceBaseTest.
>> >
>> > Besides, integration tests in Pulsar (CI - System - SQL) will break
>> after
>> > moving out Pulsar SQL. As described above, I don't know how to adjust
>> the
>> > test Docker image to overcome this issue.
>> >
>> > Do you think it's a good idea to move Pulsar SQL to a separated
>> repository?
>> > Did you meet something less than awesome because Pulsar SQL is inside
>> the
>> > main repo?
>> >
>> > Looking forward to your feedback on this topic. And perhaps you can help
>> > figure out why tests depends on MockedPulsarServiceBaseTest failed on
>> the
>> > prototype[3].
>> >
>> > Best,
>> > tison.
>> >
>> > [1] https://github.com/trinodb/trino/pull/8020
>> > [2] https://github.com/apache/pulsar/pull/16494
>> > [3] https://github.com/tisonkun/pulsar-trino
>> > [4] https://github.com/apache/pulsar-connectors
>> > [5] https://github.com/apache/pulsar-presto
>> >
>>
>

Reply via email to