> > I think we could shrink the connectors a lot by removing from the NAR > archives dependencies that are already present in the > pulsar-functions-instance. > I mean java-instance.jar
Le mer. 19 oct. 2022 à 12:29, Christophe Bornet <bornet.ch...@gmail.com> a écrit : > The pulsar-all docker image is pretty big. I assume we will continue >> to build and package additional connectors. It would be great to >> figure out how to make it smaller at some point. >> > I think we could shrink the connectors a lot by removing from the NAR > archives dependencies that are already present in the > pulsar-functions-instance. > > Le mar. 18 oct. 2022 à 16:51, Michael Marshall <mmarsh...@apache.org> a > écrit : > >> Great discussion. I have one minor comment that is tangentially related. >> >> > On building the project `pulsar-io-http-{version}.nar` will be built and >> > added to the `pulsar-all` distribution. >> >> The pulsar-all docker image is pretty big. I assume we will continue >> to build and package additional connectors. It would be great to >> figure out how to make it smaller at some point. >> >> Thanks, >> Michael >> >> >> On Tue, Sep 27, 2022 at 9:27 AM Christophe Bornet >> <bornet.ch...@gmail.com> wrote: >> > >> > Sure you can test with the Sink of my PR branch. >> > Otherwise I'll do the test after ApacheCon. >> > >> > Le mar. 27 sept. 2022 à 12:57, tison <wander4...@gmail.com> a écrit : >> > >> > > Yes. It's a potential use case for validating the implementation. If >> you >> > > don't have time to try it out, I can schedule some time to demo it >> with a >> > > prototype HTTP sink or after the patch gets merged :) >> > > >> > > Best, >> > > tison. >> > > >> > > >> > > Christophe Bornet <bornet.ch...@gmail.com> 于2022年9月27日周二 18:51写道: >> > > >> > > > Hi Tison, >> > > > >> > > > Very interesting and shows the value of such a HTTP Sink. >> > > > The Pulsar HTTP Sink should work OOTB with ClickHouse. I don't have >> time >> > > to >> > > > do the test right now, so would someone want to do it ? >> > > > >> > > > Best regards. >> > > > >> > > > Christophe Bornet >> > > > >> > > > Le mar. 27 sept. 2022 à 12:31, tison <wander4...@gmail.com> a >> écrit : >> > > > >> > > > > Hi Christophe, >> > > > > >> > > > > Thanks for starting this proposal. It looks cool. >> > > > > >> > > > > I'd suggest one real-world integration test you can make use of: >> > > > > >> https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http >> > > > > (replace source kafka with pulsar). >> > > > > >> > > > > Best, >> > > > > tison. >> > > > > >> > > > > >> > > > > Enrico Olivelli <eolive...@gmail.com> 于2022年9月27日周二 18:04写道: >> > > > > >> > > > > > Thanks for your answers. >> > > > > > I am fine with the current proposal. >> > > > > > We can enhance it as follow up work >> > > > > > >> > > > > > Enrico >> > > > > > >> > > > > > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet >> > > > > > <bornet.ch...@gmail.com> ha scritto: >> > > > > > > >> > > > > > > Thanks for your feedback Enrico. >> > > > > > > My answers to your comments below >> > > > > > > >> > > > > > > BR >> > > > > > > >> > > > > > > Christophe >> > > > > > > >> > > > > > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli < >> > > eolive...@gmail.com> >> > > > a >> > > > > > > écrit : >> > > > > > > >> > > > > > > > Christophe, >> > > > > > > > very good initiative! >> > > > > > > > >> > > > > > > > I support it >> > > > > > > > Some comments inline below >> > > > > > > > >> > > > > > > > >> > > > > > > > Enrico >> > > > > > > > >> > > > > > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet >> > > > > > > > <bornet.ch...@gmail.com> ha scritto: >> > > > > > > > > >> > > > > > > > > Hi all, >> > > > > > > > > >> > > > > > > > > I have drafted PIP-208: HTTP Sink >> > > > > > > > > >> > > > > > > > > PIP link: >> > > > > > > > > https://github.com/apache/pulsar/issues/17719 >> > > > > > > > > >> > > > > > > > > Here's a copy of the contents of the GH issue for your >> > > > references: >> > > > > > > > > >> > > > > > > > > ### Motivation >> > > > > > > > > >> > > > > > > > > Currently, when you want to consume from Pulsar topics in >> > > > > > applications >> > > > > > > > > written in languages that don't have a Pulsar driver >> supported, >> > > > you >> > > > > > need >> > > > > > > > to >> > > > > > > > > run some type of proxy like the WebSocket Proxy or Pulsar >> Beam. >> > > > In >> > > > > > > > > production this needs additional effort to deploy, scale, >> load >> > > > > > balance, >> > > > > > > > > monitor, and so on... >> > > > > > > > > Pulsar IO is a framework that deals with all these >> operational >> > > > > > subjects >> > > > > > > > and >> > > > > > > > > can be leveraged to provide a way to push messages to >> external >> > > > > > systems >> > > > > > > > > using HTTP, a protocol supported by every existing >> language and >> > > > OS. >> > > > > > > > > >> > > > > > > > > ### Goal >> > > > > > > > > >> > > > > > > > > This proposal defines an HTTP Sink that sends the >> messages to a >> > > > > > > > configured >> > > > > > > > > URL. >> > > > > > > > > It takes inspiration from [Pulsar Beam]( >> > > > > > > > > https://github.com/kafkaesque-io/pulsar-beam) and the >> > > [Confluent >> > > > > > HTTP >> > > > > > > > Sink >> > > > > > > > > connector]( >> > > > > > > > > >> > > > > > >> > > https://docs.confluent.io/kafka-connectors/http/current/overview.html >> > > > ). >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > ### Implementation >> > > > > > > > > >> > > > > > > > > A `pulsar-io-http` module will be added to `pulsar-io`. >> > > > > > > > > On building the project `pulsar-io-http-{version}.nar` >> will be >> > > > > built >> > > > > > and >> > > > > > > > > added to the `pulsar-all` distribution. >> > > > > > > > > The name of the Sink will be `http`. >> > > > > > > > > >> > > > > > > > > The HTTP Sink pushes records to any HTTP server with the >> record >> > > > > > value in >> > > > > > > > > the body of a POST method. >> > > > > > > > > The body of the HTTP request is the JSON representation >> of the >> > > > > record >> > > > > > > > value. >> > > > > > > > >> > > > > > > > What do you mean ? >> > > > > > > > I think that this should depend on the Schema. >> > > > > > > > >> > > > > > > > BYTES SCHEMA -> I would push the raw message payload >> > > > > > > > PRIMITIVE VALUES (long, integer, string) - > I would push >> the >> > > JSON >> > > > > > > > represantation >> > > > > > > > JSON SCHEMA -> push the raw message payload >> > > > > > > > AVRO -> ? convert to JSON ? >> > > > > > > > PROTOBUF -> ? convert to JSON ? >> > > > > > > > KEY-VALUE ? >> > > > > > > > >> > > > > > > > Probably we need some flag to define the behaviour for the >> non >> > > > > trivial >> > > > > > > > cases. >> > > > > > > > >> > > > > > > > The current impl chooses to serialize as JSON because it's >> a well >> > > > > > > supported content-type on the server frameworks. >> > > > > > > It's also to be consistent with existing HTTP Sinks such as >> Pulsar >> > > > Bean >> > > > > > and >> > > > > > > Confluent HTTP Sink Connector. >> > > > > > > The possibility to adapt the content-type to the schema is >> elegant >> > > > and >> > > > > > will >> > > > > > > probably result in shorter payloads (but less readable) and I >> think >> > > > it >> > > > > > > could be done as a follow-up option. >> > > > > > > It has indeed the problem of being difficult to do for KV >> schema. >> > > > > > > For the content-type mappings I would do: >> > > > > > > BYTES SCHEMA -> application/octet-stream (raw bytes) >> > > > > > > PRIMITIVE VALUES (long, integer, string) - > text/plain >> > > > > > > JSON -> application/json >> > > > > > > AVRO -> avro/binary >> > > > > > > PROTOBUF -> probably application/octet-stream ? >> > > > > > > KEY-VALUE ? >> > > > > > > >> > > > > > > Would also need to indicate the Schema-Type in the HTTP >> headers. >> > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > > Some headers are added to the HTTP request: >> > > > > > > > > * `PulsarTopic`: the topic of the record >> > > > > > > > > * `PulsarKey`: the key of the record >> > > > > > > > > * `PulsarEventTime`: the event time of the record >> > > > > > > > > * `PulsarPublishTime`: the publish time of the record >> > > > > > > > > * `PulsarMessageId`: the ID of the message contained in >> the >> > > > record >> > > > > > > > > * `PulsarProperties-*`: each record property is passed >> with the >> > > > > > property >> > > > > > > > > name prefixed by `PulsarProperties-` >> > > > > > > > > >> > > > > > > > >> > > > > > > > Can we make the "Content-Type" configurable ? >> > > > > > > > >> > > > > > > Yes we can. But do we do it for the first iteration ? >> > > > > > > If we do it, I would have an option to add some fix headers >> and the >> > > > > user >> > > > > > > can override the content-type. >> > > > > > > If we go for a variable content-type depending on the schema, >> then >> > > we >> > > > > > could >> > > > > > > have a map<SchemaType, content-type> >> > > > > > > >> > > > > > > > Can we make the HTTP METHOD configurable ? >> > > > > > > > >> > > > > > > Yes we can. But do we do it for the first iteration ? >> > > > > > > >> > > > > > > > >> > > > > > > > > ### Alternatives >> > > > > > > > > >> > > > > > > > > Creating a separated project for this Sink is rejected >> since: >> > > > > > > > > * this Sink is very useful for developers to test the >> Pulsar IO >> > > > > > > > framework, >> > > > > > > > > transform functions, and to make demos. >> > > > > > > > > * the code has a very small footprint with no external >> > > > > dependencies. >> > > > > > > > > * it should be visible at the same level as other sinks >> > > > > > > > >> > > > > > > > 100% agreed ! >> > > > > > > > >> > > > > > > > > >> > > > > > > > > I'm looking forward the discussion. >> > > > > > > > > >> > > > > > > > > Best regards, >> > > > > > > > > >> > > > > > > > > Christophe Bornet >> > > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > Le mar. 18 oct. 2022 à 16:51, Michael Marshall <mmarsh...@apache.org> a > écrit : > >> Great discussion. I have one minor comment that is tangentially related. >> >> > On building the project `pulsar-io-http-{version}.nar` will be built and >> > added to the `pulsar-all` distribution. >> >> The pulsar-all docker image is pretty big. I assume we will continue >> to build and package additional connectors. It would be great to >> figure out how to make it smaller at some point. >> >> Thanks, >> Michael >> >> >> On Tue, Sep 27, 2022 at 9:27 AM Christophe Bornet >> <bornet.ch...@gmail.com> wrote: >> > >> > Sure you can test with the Sink of my PR branch. >> > Otherwise I'll do the test after ApacheCon. >> > >> > Le mar. 27 sept. 2022 à 12:57, tison <wander4...@gmail.com> a écrit : >> > >> > > Yes. It's a potential use case for validating the implementation. If >> you >> > > don't have time to try it out, I can schedule some time to demo it >> with a >> > > prototype HTTP sink or after the patch gets merged :) >> > > >> > > Best, >> > > tison. >> > > >> > > >> > > Christophe Bornet <bornet.ch...@gmail.com> 于2022年9月27日周二 18:51写道: >> > > >> > > > Hi Tison, >> > > > >> > > > Very interesting and shows the value of such a HTTP Sink. >> > > > The Pulsar HTTP Sink should work OOTB with ClickHouse. I don't have >> time >> > > to >> > > > do the test right now, so would someone want to do it ? >> > > > >> > > > Best regards. >> > > > >> > > > Christophe Bornet >> > > > >> > > > Le mar. 27 sept. 2022 à 12:31, tison <wander4...@gmail.com> a >> écrit : >> > > > >> > > > > Hi Christophe, >> > > > > >> > > > > Thanks for starting this proposal. It looks cool. >> > > > > >> > > > > I'd suggest one real-world integration test you can make use of: >> > > > > >> https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http >> > > > > (replace source kafka with pulsar). >> > > > > >> > > > > Best, >> > > > > tison. >> > > > > >> > > > > >> > > > > Enrico Olivelli <eolive...@gmail.com> 于2022年9月27日周二 18:04写道: >> > > > > >> > > > > > Thanks for your answers. >> > > > > > I am fine with the current proposal. >> > > > > > We can enhance it as follow up work >> > > > > > >> > > > > > Enrico >> > > > > > >> > > > > > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet >> > > > > > <bornet.ch...@gmail.com> ha scritto: >> > > > > > > >> > > > > > > Thanks for your feedback Enrico. >> > > > > > > My answers to your comments below >> > > > > > > >> > > > > > > BR >> > > > > > > >> > > > > > > Christophe >> > > > > > > >> > > > > > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli < >> > > eolive...@gmail.com> >> > > > a >> > > > > > > écrit : >> > > > > > > >> > > > > > > > Christophe, >> > > > > > > > very good initiative! >> > > > > > > > >> > > > > > > > I support it >> > > > > > > > Some comments inline below >> > > > > > > > >> > > > > > > > >> > > > > > > > Enrico >> > > > > > > > >> > > > > > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet >> > > > > > > > <bornet.ch...@gmail.com> ha scritto: >> > > > > > > > > >> > > > > > > > > Hi all, >> > > > > > > > > >> > > > > > > > > I have drafted PIP-208: HTTP Sink >> > > > > > > > > >> > > > > > > > > PIP link: >> > > > > > > > > https://github.com/apache/pulsar/issues/17719 >> > > > > > > > > >> > > > > > > > > Here's a copy of the contents of the GH issue for your >> > > > references: >> > > > > > > > > >> > > > > > > > > ### Motivation >> > > > > > > > > >> > > > > > > > > Currently, when you want to consume from Pulsar topics in >> > > > > > applications >> > > > > > > > > written in languages that don't have a Pulsar driver >> supported, >> > > > you >> > > > > > need >> > > > > > > > to >> > > > > > > > > run some type of proxy like the WebSocket Proxy or Pulsar >> Beam. >> > > > In >> > > > > > > > > production this needs additional effort to deploy, scale, >> load >> > > > > > balance, >> > > > > > > > > monitor, and so on... >> > > > > > > > > Pulsar IO is a framework that deals with all these >> operational >> > > > > > subjects >> > > > > > > > and >> > > > > > > > > can be leveraged to provide a way to push messages to >> external >> > > > > > systems >> > > > > > > > > using HTTP, a protocol supported by every existing >> language and >> > > > OS. >> > > > > > > > > >> > > > > > > > > ### Goal >> > > > > > > > > >> > > > > > > > > This proposal defines an HTTP Sink that sends the >> messages to a >> > > > > > > > configured >> > > > > > > > > URL. >> > > > > > > > > It takes inspiration from [Pulsar Beam]( >> > > > > > > > > https://github.com/kafkaesque-io/pulsar-beam) and the >> > > [Confluent >> > > > > > HTTP >> > > > > > > > Sink >> > > > > > > > > connector]( >> > > > > > > > > >> > > > > > >> > > https://docs.confluent.io/kafka-connectors/http/current/overview.html >> > > > ). >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > ### Implementation >> > > > > > > > > >> > > > > > > > > A `pulsar-io-http` module will be added to `pulsar-io`. >> > > > > > > > > On building the project `pulsar-io-http-{version}.nar` >> will be >> > > > > built >> > > > > > and >> > > > > > > > > added to the `pulsar-all` distribution. >> > > > > > > > > The name of the Sink will be `http`. >> > > > > > > > > >> > > > > > > > > The HTTP Sink pushes records to any HTTP server with the >> record >> > > > > > value in >> > > > > > > > > the body of a POST method. >> > > > > > > > > The body of the HTTP request is the JSON representation >> of the >> > > > > record >> > > > > > > > value. >> > > > > > > > >> > > > > > > > What do you mean ? >> > > > > > > > I think that this should depend on the Schema. >> > > > > > > > >> > > > > > > > BYTES SCHEMA -> I would push the raw message payload >> > > > > > > > PRIMITIVE VALUES (long, integer, string) - > I would push >> the >> > > JSON >> > > > > > > > represantation >> > > > > > > > JSON SCHEMA -> push the raw message payload >> > > > > > > > AVRO -> ? convert to JSON ? >> > > > > > > > PROTOBUF -> ? convert to JSON ? >> > > > > > > > KEY-VALUE ? >> > > > > > > > >> > > > > > > > Probably we need some flag to define the behaviour for the >> non >> > > > > trivial >> > > > > > > > cases. >> > > > > > > > >> > > > > > > > The current impl chooses to serialize as JSON because it's >> a well >> > > > > > > supported content-type on the server frameworks. >> > > > > > > It's also to be consistent with existing HTTP Sinks such as >> Pulsar >> > > > Bean >> > > > > > and >> > > > > > > Confluent HTTP Sink Connector. >> > > > > > > The possibility to adapt the content-type to the schema is >> elegant >> > > > and >> > > > > > will >> > > > > > > probably result in shorter payloads (but less readable) and I >> think >> > > > it >> > > > > > > could be done as a follow-up option. >> > > > > > > It has indeed the problem of being difficult to do for KV >> schema. >> > > > > > > For the content-type mappings I would do: >> > > > > > > BYTES SCHEMA -> application/octet-stream (raw bytes) >> > > > > > > PRIMITIVE VALUES (long, integer, string) - > text/plain >> > > > > > > JSON -> application/json >> > > > > > > AVRO -> avro/binary >> > > > > > > PROTOBUF -> probably application/octet-stream ? >> > > > > > > KEY-VALUE ? >> > > > > > > >> > > > > > > Would also need to indicate the Schema-Type in the HTTP >> headers. >> > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > > Some headers are added to the HTTP request: >> > > > > > > > > * `PulsarTopic`: the topic of the record >> > > > > > > > > * `PulsarKey`: the key of the record >> > > > > > > > > * `PulsarEventTime`: the event time of the record >> > > > > > > > > * `PulsarPublishTime`: the publish time of the record >> > > > > > > > > * `PulsarMessageId`: the ID of the message contained in >> the >> > > > record >> > > > > > > > > * `PulsarProperties-*`: each record property is passed >> with the >> > > > > > property >> > > > > > > > > name prefixed by `PulsarProperties-` >> > > > > > > > > >> > > > > > > > >> > > > > > > > Can we make the "Content-Type" configurable ? >> > > > > > > > >> > > > > > > Yes we can. But do we do it for the first iteration ? >> > > > > > > If we do it, I would have an option to add some fix headers >> and the >> > > > > user >> > > > > > > can override the content-type. >> > > > > > > If we go for a variable content-type depending on the schema, >> then >> > > we >> > > > > > could >> > > > > > > have a map<SchemaType, content-type> >> > > > > > > >> > > > > > > > Can we make the HTTP METHOD configurable ? >> > > > > > > > >> > > > > > > Yes we can. But do we do it for the first iteration ? >> > > > > > > >> > > > > > > > >> > > > > > > > > ### Alternatives >> > > > > > > > > >> > > > > > > > > Creating a separated project for this Sink is rejected >> since: >> > > > > > > > > * this Sink is very useful for developers to test the >> Pulsar IO >> > > > > > > > framework, >> > > > > > > > > transform functions, and to make demos. >> > > > > > > > > * the code has a very small footprint with no external >> > > > > dependencies. >> > > > > > > > > * it should be visible at the same level as other sinks >> > > > > > > > >> > > > > > > > 100% agreed ! >> > > > > > > > >> > > > > > > > > >> > > > > > > > > I'm looking forward the discussion. >> > > > > > > > > >> > > > > > > > > Best regards, >> > > > > > > > > >> > > > > > > > > Christophe Bornet >> > > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> >