>
> I think we could shrink the connectors a lot by removing from the NAR
> archives dependencies that are already present in the
> pulsar-functions-instance.
>
I mean java-instance.jar

Le mer. 19 oct. 2022 à 12:29, Christophe Bornet <bornet.ch...@gmail.com> a
écrit :

> The pulsar-all docker image is pretty big. I assume we will continue
>> to build and package additional connectors. It would be great to
>> figure out how to make it smaller at some point.
>>
> I think we could shrink the connectors a lot by removing from the NAR
> archives dependencies that are already present in the
> pulsar-functions-instance.
>
> Le mar. 18 oct. 2022 à 16:51, Michael Marshall <mmarsh...@apache.org> a
> écrit :
>
>> Great discussion. I have one minor comment that is tangentially related.
>>
>> > On building the project `pulsar-io-http-{version}.nar` will be built and
>> > added to the `pulsar-all` distribution.
>>
>> The pulsar-all docker image is pretty big. I assume we will continue
>> to build and package additional connectors. It would be great to
>> figure out how to make it smaller at some point.
>>
>> Thanks,
>> Michael
>>
>>
>> On Tue, Sep 27, 2022 at 9:27 AM Christophe Bornet
>> <bornet.ch...@gmail.com> wrote:
>> >
>> > Sure you can test with the Sink of my PR branch.
>> > Otherwise I'll do the test after ApacheCon.
>> >
>> > Le mar. 27 sept. 2022 à 12:57, tison <wander4...@gmail.com> a écrit :
>> >
>> > > Yes. It's a potential use case for validating the implementation. If
>> you
>> > > don't have time to try it out, I can schedule some time to demo it
>> with a
>> > > prototype HTTP sink or after the patch gets merged :)
>> > >
>> > > Best,
>> > > tison.
>> > >
>> > >
>> > > Christophe Bornet <bornet.ch...@gmail.com> 于2022年9月27日周二 18:51写道:
>> > >
>> > > > Hi Tison,
>> > > >
>> > > > Very interesting and shows the value of such a HTTP Sink.
>> > > > The Pulsar HTTP Sink should work OOTB with ClickHouse. I don't have
>> time
>> > > to
>> > > > do the test right now, so would someone want to do it ?
>> > > >
>> > > > Best regards.
>> > > >
>> > > > Christophe Bornet
>> > > >
>> > > > Le mar. 27 sept. 2022 à 12:31, tison <wander4...@gmail.com> a
>> écrit :
>> > > >
>> > > > > Hi Christophe,
>> > > > >
>> > > > > Thanks for starting this proposal. It looks cool.
>> > > > >
>> > > > > I'd suggest one real-world integration test you can make use of:
>> > > > >
>> https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http
>> > > > > (replace source kafka with pulsar).
>> > > > >
>> > > > > Best,
>> > > > > tison.
>> > > > >
>> > > > >
>> > > > > Enrico Olivelli <eolive...@gmail.com> 于2022年9月27日周二 18:04写道:
>> > > > >
>> > > > > > Thanks for your answers.
>> > > > > > I am fine with the current proposal.
>> > > > > > We can enhance it as follow up work
>> > > > > >
>> > > > > > Enrico
>> > > > > >
>> > > > > > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
>> > > > > > <bornet.ch...@gmail.com> ha scritto:
>> > > > > > >
>> > > > > > > Thanks for your feedback Enrico.
>> > > > > > > My answers to your comments below
>> > > > > > >
>> > > > > > > BR
>> > > > > > >
>> > > > > > > Christophe
>> > > > > > >
>> > > > > > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <
>> > > eolive...@gmail.com>
>> > > > a
>> > > > > > > écrit :
>> > > > > > >
>> > > > > > > > Christophe,
>> > > > > > > > very good initiative!
>> > > > > > > >
>> > > > > > > > I support it
>> > > > > > > > Some comments inline below
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Enrico
>> > > > > > > >
>> > > > > > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
>> > > > > > > > <bornet.ch...@gmail.com> ha scritto:
>> > > > > > > > >
>> > > > > > > > > Hi all,
>> > > > > > > > >
>> > > > > > > > > I have drafted PIP-208: HTTP Sink
>> > > > > > > > >
>> > > > > > > > > PIP link:
>> > > > > > > > > https://github.com/apache/pulsar/issues/17719
>> > > > > > > > >
>> > > > > > > > > Here's a copy of the contents of the GH issue for your
>> > > > references:
>> > > > > > > > >
>> > > > > > > > > ### Motivation
>> > > > > > > > >
>> > > > > > > > > Currently, when you want to consume from Pulsar topics in
>> > > > > > applications
>> > > > > > > > > written in languages that don't have a Pulsar driver
>> supported,
>> > > > you
>> > > > > > need
>> > > > > > > > to
>> > > > > > > > > run some type of proxy like the WebSocket Proxy or Pulsar
>> Beam.
>> > > > In
>> > > > > > > > > production this needs additional effort to deploy, scale,
>> load
>> > > > > > balance,
>> > > > > > > > > monitor, and so on...
>> > > > > > > > > Pulsar IO is a framework that deals with all these
>> operational
>> > > > > > subjects
>> > > > > > > > and
>> > > > > > > > > can be leveraged to provide a way to push messages to
>> external
>> > > > > > systems
>> > > > > > > > > using HTTP, a protocol supported by every existing
>> language and
>> > > > OS.
>> > > > > > > > >
>> > > > > > > > > ### Goal
>> > > > > > > > >
>> > > > > > > > > This proposal defines an HTTP Sink that sends the
>> messages to a
>> > > > > > > > configured
>> > > > > > > > > URL.
>> > > > > > > > > It takes inspiration from [Pulsar Beam](
>> > > > > > > > > https://github.com/kafkaesque-io/pulsar-beam) and the
>> > > [Confluent
>> > > > > > HTTP
>> > > > > > > > Sink
>> > > > > > > > > connector](
>> > > > > > > > >
>> > > > > >
>> > > https://docs.confluent.io/kafka-connectors/http/current/overview.html
>> > > > ).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > ### Implementation
>> > > > > > > > >
>> > > > > > > > > A `pulsar-io-http` module will be added to `pulsar-io`.
>> > > > > > > > > On building the project `pulsar-io-http-{version}.nar`
>> will be
>> > > > > built
>> > > > > > and
>> > > > > > > > > added to the `pulsar-all` distribution.
>> > > > > > > > > The name of the Sink will be `http`.
>> > > > > > > > >
>> > > > > > > > > The HTTP Sink pushes records to any HTTP server with the
>> record
>> > > > > > value in
>> > > > > > > > > the body of a POST method.
>> > > > > > > > > The body of the HTTP request is the JSON representation
>> of the
>> > > > > record
>> > > > > > > > value.
>> > > > > > > >
>> > > > > > > > What do you mean ?
>> > > > > > > > I think that this should depend on the Schema.
>> > > > > > > >
>> > > > > > > > BYTES SCHEMA -> I would push the raw message payload
>> > > > > > > > PRIMITIVE VALUES (long, integer, string) - > I would push
>> the
>> > > JSON
>> > > > > > > > represantation
>> > > > > > > > JSON SCHEMA ->  push the raw message payload
>> > > > > > > > AVRO -> ?  convert to JSON ?
>> > > > > > > > PROTOBUF -> ? convert to JSON ?
>> > > > > > > > KEY-VALUE ?
>> > > > > > > >
>> > > > > > > > Probably we need some flag to define the behaviour for the
>> non
>> > > > > trivial
>> > > > > > > > cases.
>> > > > > > > >
>> > > > > > > > The current impl chooses to serialize as JSON because it's
>> a well
>> > > > > > > supported content-type on the server frameworks.
>> > > > > > > It's also to be consistent with existing HTTP Sinks such as
>> Pulsar
>> > > > Bean
>> > > > > > and
>> > > > > > > Confluent HTTP Sink Connector.
>> > > > > > > The possibility to adapt the content-type to the schema is
>> elegant
>> > > > and
>> > > > > > will
>> > > > > > > probably result in shorter payloads (but less readable) and I
>> think
>> > > > it
>> > > > > > > could be done as a follow-up option.
>> > > > > > > It has indeed the problem of being difficult to do for KV
>> schema.
>> > > > > > > For the content-type mappings I would do:
>> > > > > > > BYTES SCHEMA -> application/octet-stream (raw bytes)
>> > > > > > > PRIMITIVE VALUES (long, integer, string) - > text/plain
>> > > > > > > JSON ->  application/json
>> > > > > > > AVRO -> avro/binary
>> > > > > > > PROTOBUF -> probably application/octet-stream ?
>> > > > > > > KEY-VALUE ?
>> > > > > > >
>> > > > > > > Would also need to indicate the Schema-Type in the HTTP
>> headers.
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Some headers are added to the HTTP request:
>> > > > > > > > > * `PulsarTopic`: the topic of the record
>> > > > > > > > > * `PulsarKey`: the key of the record
>> > > > > > > > > * `PulsarEventTime`: the event time of the record
>> > > > > > > > > * `PulsarPublishTime`: the publish time of the record
>> > > > > > > > > * `PulsarMessageId`: the ID of the message contained in
>> the
>> > > > record
>> > > > > > > > > * `PulsarProperties-*`: each record property is passed
>> with the
>> > > > > > property
>> > > > > > > > > name prefixed by `PulsarProperties-`
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > Can we make the "Content-Type" configurable ?
>> > > > > > > >
>> > > > > > > Yes we can. But do we do it for the first iteration ?
>> > > > > > > If we do it, I would have an option to add some fix headers
>> and the
>> > > > > user
>> > > > > > > can override the content-type.
>> > > > > > > If we go for a variable content-type depending on the schema,
>> then
>> > > we
>> > > > > > could
>> > > > > > > have a map<SchemaType, content-type>
>> > > > > > >
>> > > > > > > > Can we make the HTTP METHOD configurable ?
>> > > > > > > >
>> > > > > > > Yes we can. But do we do it for the first iteration ?
>> > > > > > >
>> > > > > > > >
>> > > > > > > > > ### Alternatives
>> > > > > > > > >
>> > > > > > > > > Creating a separated project for this Sink is rejected
>> since:
>> > > > > > > > > * this Sink is very useful for developers to test the
>> Pulsar IO
>> > > > > > > > framework,
>> > > > > > > > > transform functions, and to make demos.
>> > > > > > > > > * the code has a very small footprint with no external
>> > > > > dependencies.
>> > > > > > > > > * it should be visible at the same level as other sinks
>> > > > > > > >
>> > > > > > > > 100% agreed !
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I'm looking forward the discussion.
>> > > > > > > > >
>> > > > > > > > > Best regards,
>> > > > > > > > >
>> > > > > > > > > Christophe Bornet
>> > > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>>
>
> Le mar. 18 oct. 2022 à 16:51, Michael Marshall <mmarsh...@apache.org> a
> écrit :
>
>> Great discussion. I have one minor comment that is tangentially related.
>>
>> > On building the project `pulsar-io-http-{version}.nar` will be built and
>> > added to the `pulsar-all` distribution.
>>
>> The pulsar-all docker image is pretty big. I assume we will continue
>> to build and package additional connectors. It would be great to
>> figure out how to make it smaller at some point.
>>
>> Thanks,
>> Michael
>>
>>
>> On Tue, Sep 27, 2022 at 9:27 AM Christophe Bornet
>> <bornet.ch...@gmail.com> wrote:
>> >
>> > Sure you can test with the Sink of my PR branch.
>> > Otherwise I'll do the test after ApacheCon.
>> >
>> > Le mar. 27 sept. 2022 à 12:57, tison <wander4...@gmail.com> a écrit :
>> >
>> > > Yes. It's a potential use case for validating the implementation. If
>> you
>> > > don't have time to try it out, I can schedule some time to demo it
>> with a
>> > > prototype HTTP sink or after the patch gets merged :)
>> > >
>> > > Best,
>> > > tison.
>> > >
>> > >
>> > > Christophe Bornet <bornet.ch...@gmail.com> 于2022年9月27日周二 18:51写道:
>> > >
>> > > > Hi Tison,
>> > > >
>> > > > Very interesting and shows the value of such a HTTP Sink.
>> > > > The Pulsar HTTP Sink should work OOTB with ClickHouse. I don't have
>> time
>> > > to
>> > > > do the test right now, so would someone want to do it ?
>> > > >
>> > > > Best regards.
>> > > >
>> > > > Christophe Bornet
>> > > >
>> > > > Le mar. 27 sept. 2022 à 12:31, tison <wander4...@gmail.com> a
>> écrit :
>> > > >
>> > > > > Hi Christophe,
>> > > > >
>> > > > > Thanks for starting this proposal. It looks cool.
>> > > > >
>> > > > > I'd suggest one real-world integration test you can make use of:
>> > > > >
>> https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http
>> > > > > (replace source kafka with pulsar).
>> > > > >
>> > > > > Best,
>> > > > > tison.
>> > > > >
>> > > > >
>> > > > > Enrico Olivelli <eolive...@gmail.com> 于2022年9月27日周二 18:04写道:
>> > > > >
>> > > > > > Thanks for your answers.
>> > > > > > I am fine with the current proposal.
>> > > > > > We can enhance it as follow up work
>> > > > > >
>> > > > > > Enrico
>> > > > > >
>> > > > > > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
>> > > > > > <bornet.ch...@gmail.com> ha scritto:
>> > > > > > >
>> > > > > > > Thanks for your feedback Enrico.
>> > > > > > > My answers to your comments below
>> > > > > > >
>> > > > > > > BR
>> > > > > > >
>> > > > > > > Christophe
>> > > > > > >
>> > > > > > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <
>> > > eolive...@gmail.com>
>> > > > a
>> > > > > > > écrit :
>> > > > > > >
>> > > > > > > > Christophe,
>> > > > > > > > very good initiative!
>> > > > > > > >
>> > > > > > > > I support it
>> > > > > > > > Some comments inline below
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Enrico
>> > > > > > > >
>> > > > > > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
>> > > > > > > > <bornet.ch...@gmail.com> ha scritto:
>> > > > > > > > >
>> > > > > > > > > Hi all,
>> > > > > > > > >
>> > > > > > > > > I have drafted PIP-208: HTTP Sink
>> > > > > > > > >
>> > > > > > > > > PIP link:
>> > > > > > > > > https://github.com/apache/pulsar/issues/17719
>> > > > > > > > >
>> > > > > > > > > Here's a copy of the contents of the GH issue for your
>> > > > references:
>> > > > > > > > >
>> > > > > > > > > ### Motivation
>> > > > > > > > >
>> > > > > > > > > Currently, when you want to consume from Pulsar topics in
>> > > > > > applications
>> > > > > > > > > written in languages that don't have a Pulsar driver
>> supported,
>> > > > you
>> > > > > > need
>> > > > > > > > to
>> > > > > > > > > run some type of proxy like the WebSocket Proxy or Pulsar
>> Beam.
>> > > > In
>> > > > > > > > > production this needs additional effort to deploy, scale,
>> load
>> > > > > > balance,
>> > > > > > > > > monitor, and so on...
>> > > > > > > > > Pulsar IO is a framework that deals with all these
>> operational
>> > > > > > subjects
>> > > > > > > > and
>> > > > > > > > > can be leveraged to provide a way to push messages to
>> external
>> > > > > > systems
>> > > > > > > > > using HTTP, a protocol supported by every existing
>> language and
>> > > > OS.
>> > > > > > > > >
>> > > > > > > > > ### Goal
>> > > > > > > > >
>> > > > > > > > > This proposal defines an HTTP Sink that sends the
>> messages to a
>> > > > > > > > configured
>> > > > > > > > > URL.
>> > > > > > > > > It takes inspiration from [Pulsar Beam](
>> > > > > > > > > https://github.com/kafkaesque-io/pulsar-beam) and the
>> > > [Confluent
>> > > > > > HTTP
>> > > > > > > > Sink
>> > > > > > > > > connector](
>> > > > > > > > >
>> > > > > >
>> > > https://docs.confluent.io/kafka-connectors/http/current/overview.html
>> > > > ).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > ### Implementation
>> > > > > > > > >
>> > > > > > > > > A `pulsar-io-http` module will be added to `pulsar-io`.
>> > > > > > > > > On building the project `pulsar-io-http-{version}.nar`
>> will be
>> > > > > built
>> > > > > > and
>> > > > > > > > > added to the `pulsar-all` distribution.
>> > > > > > > > > The name of the Sink will be `http`.
>> > > > > > > > >
>> > > > > > > > > The HTTP Sink pushes records to any HTTP server with the
>> record
>> > > > > > value in
>> > > > > > > > > the body of a POST method.
>> > > > > > > > > The body of the HTTP request is the JSON representation
>> of the
>> > > > > record
>> > > > > > > > value.
>> > > > > > > >
>> > > > > > > > What do you mean ?
>> > > > > > > > I think that this should depend on the Schema.
>> > > > > > > >
>> > > > > > > > BYTES SCHEMA -> I would push the raw message payload
>> > > > > > > > PRIMITIVE VALUES (long, integer, string) - > I would push
>> the
>> > > JSON
>> > > > > > > > represantation
>> > > > > > > > JSON SCHEMA ->  push the raw message payload
>> > > > > > > > AVRO -> ?  convert to JSON ?
>> > > > > > > > PROTOBUF -> ? convert to JSON ?
>> > > > > > > > KEY-VALUE ?
>> > > > > > > >
>> > > > > > > > Probably we need some flag to define the behaviour for the
>> non
>> > > > > trivial
>> > > > > > > > cases.
>> > > > > > > >
>> > > > > > > > The current impl chooses to serialize as JSON because it's
>> a well
>> > > > > > > supported content-type on the server frameworks.
>> > > > > > > It's also to be consistent with existing HTTP Sinks such as
>> Pulsar
>> > > > Bean
>> > > > > > and
>> > > > > > > Confluent HTTP Sink Connector.
>> > > > > > > The possibility to adapt the content-type to the schema is
>> elegant
>> > > > and
>> > > > > > will
>> > > > > > > probably result in shorter payloads (but less readable) and I
>> think
>> > > > it
>> > > > > > > could be done as a follow-up option.
>> > > > > > > It has indeed the problem of being difficult to do for KV
>> schema.
>> > > > > > > For the content-type mappings I would do:
>> > > > > > > BYTES SCHEMA -> application/octet-stream (raw bytes)
>> > > > > > > PRIMITIVE VALUES (long, integer, string) - > text/plain
>> > > > > > > JSON ->  application/json
>> > > > > > > AVRO -> avro/binary
>> > > > > > > PROTOBUF -> probably application/octet-stream ?
>> > > > > > > KEY-VALUE ?
>> > > > > > >
>> > > > > > > Would also need to indicate the Schema-Type in the HTTP
>> headers.
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Some headers are added to the HTTP request:
>> > > > > > > > > * `PulsarTopic`: the topic of the record
>> > > > > > > > > * `PulsarKey`: the key of the record
>> > > > > > > > > * `PulsarEventTime`: the event time of the record
>> > > > > > > > > * `PulsarPublishTime`: the publish time of the record
>> > > > > > > > > * `PulsarMessageId`: the ID of the message contained in
>> the
>> > > > record
>> > > > > > > > > * `PulsarProperties-*`: each record property is passed
>> with the
>> > > > > > property
>> > > > > > > > > name prefixed by `PulsarProperties-`
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > Can we make the "Content-Type" configurable ?
>> > > > > > > >
>> > > > > > > Yes we can. But do we do it for the first iteration ?
>> > > > > > > If we do it, I would have an option to add some fix headers
>> and the
>> > > > > user
>> > > > > > > can override the content-type.
>> > > > > > > If we go for a variable content-type depending on the schema,
>> then
>> > > we
>> > > > > > could
>> > > > > > > have a map<SchemaType, content-type>
>> > > > > > >
>> > > > > > > > Can we make the HTTP METHOD configurable ?
>> > > > > > > >
>> > > > > > > Yes we can. But do we do it for the first iteration ?
>> > > > > > >
>> > > > > > > >
>> > > > > > > > > ### Alternatives
>> > > > > > > > >
>> > > > > > > > > Creating a separated project for this Sink is rejected
>> since:
>> > > > > > > > > * this Sink is very useful for developers to test the
>> Pulsar IO
>> > > > > > > > framework,
>> > > > > > > > > transform functions, and to make demos.
>> > > > > > > > > * the code has a very small footprint with no external
>> > > > > dependencies.
>> > > > > > > > > * it should be visible at the same level as other sinks
>> > > > > > > >
>> > > > > > > > 100% agreed !
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I'm looking forward the discussion.
>> > > > > > > > >
>> > > > > > > > > Best regards,
>> > > > > > > > >
>> > > > > > > > > Christophe Bornet
>> > > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>>
>

Reply via email to