Hi devs,

*Backgorund*

In the past few weeks I pay some time on Pulsar SQL or name it Pulsar Trino
connector.

I noticed in Trino community our committer Marvin (@xxc) ever submitted a
PR to contribute the connector to upstream[1]. However, due to the huge
version gap and lack of time to spend on that topic, it stalls about ten
months ago.

At the moment the latest Pulsar version was 2.8.0 while we're preparing
2.11.0 now. Besides, in Pulsar main repo we make several changes for the
connector, especially add Decimal support incompatible with upstream
changes.

Then I have an idea to bump the version of Trino to migrate code from
Pulsar side[2]. Matteo reached out to me that it can be a worthy work
to remove the hard-dependency on Presto and if it’s going to take longer
time to get Trino to accept and merge it, move the Presto connector to a
separate repository, still within the Pulsar project.

I try to prototype the work of this idea this week[3] and get more insights
about it. So I'd like to start this discussion to find other contributors
interested in this topic, figure out if moving Pulsar SQL to a separated
repository is a good idea, and discuss a few concrete challenges.

*Motivation*

The strongest motivation to move Pulsar SQL to a separated repository is
that even if a Pulsar user never uses Pulsar SQL, those libs are included
in the release tarball and take over 50% space of the distribution. This is
similar to the Pulsar Docker image.

Another motivation is that we can simplify the codebase of main repo with
this movement. I found repo pulsar-connectors[4] and pulsar-presto[5]
exists but we didn't push them forward.

Pulsar SQL modules are relatively stable and deserved for its own life.

*Scope*

The technique part for moving Pulsar SQL to a separated repository includes:

1. Moving out the source code.
2. Adapt Pulsar distribution logic so that we no longer includes Pulsar SQL
libs.
3. Adapt Docker build logic for the same purpose.
4. Define a release strategy for Pulsar SQL especially Pulsar compatibility
policy.
5. Build Pulsar SQL release tarball and Docker image.
6. Wire integration tests with the new repo model.

It contains several non-trivial works. I'll show you one by one in the next
section.

*Challenges*

Fortunately, moving out the source code and generate a distribution can be
done trivially based on our existing codebase. The prototype[3] achieves
this goal, while it will take some efforts to use the distribution
out-of-box.

> Adapt Pulsar distribution logic so that we no longer includes Pulsar SQL
libs.

We can remove the modules and related packaging logic. However, we may
retain the "sql-worker" and "sql" command in `pulsar` script so that if
users download Pulsar SQL tarball and extract them under lib/presto, the
experience won't change. This is similar to how users download connectors
and place them under connectors folder.

> Adapt Docker build logic for the same purpose.

I'm unfamiliar with Docker so I don't even know current logics. The most
challenging part should be the test images (tests/docker-images).

> Define a release strategy for Pulsar SQL especially Pulsar compatibility
policy.

We can release Pulsar SQL (may be renamed to Pulsar Trino Connector) from
1.0. It's quite clumsy to do simultaneous release among multiple repos.

***Also, this proposal can require a major version bump as it makes
breaking changes.***

In the prototype, I have to compile Pulsar 2.11.0-SNAPSHOT instead of setup
Pulsar version to 2.10.1, because latest Pulsar SQL
uses LedgerOffloaderStats which is unreleased yet.

This is biggest challenge that stops me from moving out Pulsar SQL. If the
development is tightly coupled, developing and debugging cross repositories
would be a nightmare :(

> Wire integration tests with the new repo model.

So far, the prototype cannot pass all tests. Failed tests include:

* Tests depends on MockedPulsarServiceBaseTest.

Besides, integration tests in Pulsar (CI - System - SQL) will break after
moving out Pulsar SQL. As described above, I don't know how to adjust the
test Docker image to overcome this issue.

Do you think it's a good idea to move Pulsar SQL to a separated repository?
Did you meet something less than awesome because Pulsar SQL is inside the
main repo?

Looking forward to your feedback on this topic. And perhaps you can help
figure out why tests depends on MockedPulsarServiceBaseTest failed on the
prototype[3].

Best,
tison.

[1] https://github.com/trinodb/trino/pull/8020
[2] https://github.com/apache/pulsar/pull/16494
[3] https://github.com/tisonkun/pulsar-trino
[4] https://github.com/apache/pulsar-connectors
[5] https://github.com/apache/pulsar-presto

Reply via email to