Re: [DISCUSS] PIP-208: HTTP Sink

Christophe Bornet Wed, 19 Oct 2022 03:25:47 -0700

I just tested and the HTTP Sink works fine with the ClickHouse HTTP ingest
API.


Create table in ClickHouse:
CREATE TABLE test
(
    col1 String
) ENGINE = MergeTree()
PRIMARY KEY (col1)

Create Sink:
bin/pulsar-admin sinks create -a
pulsar-io/http/target/pulsar-io-http-2.11.0-SNAPSHOT.nar -i my-topic --name
my-sink --sink-config '{"url": "
http://localhost:8123?query=INSERT%20INTO%20default.test%20FORMAT%20JSONEachRow
"}'

Produce message:
bin/pulsar-client produce -vs
'json:{"name":"test","type":"record","fields":[{"name":"col1","type":"string"}]}'
my-topic -m '{"col1":"some-value"}'

Best regards.

Christophe

Le mar. 27 sept. 2022 à 12:31, tison <wander4...@gmail.com> a écrit :

> Hi Christophe,
>
> Thanks for starting this proposal. It looks cool.
>
> I'd suggest one real-world integration test you can make use of:
> https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http
> (replace source kafka with pulsar).
>
> Best,
> tison.
>
>
> Enrico Olivelli <eolive...@gmail.com> 于2022年9月27日周二 18:04写道：
>
> > Thanks for your answers.
> > I am fine with the current proposal.
> > We can enhance it as follow up work
> >
> > Enrico
> >
> > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
> > <bornet.ch...@gmail.com> ha scritto:
> > >
> > > Thanks for your feedback Enrico.
> > > My answers to your comments below
> > >
> > > BR
> > >
> > > Christophe
> > >
> > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <eolive...@gmail.com> a
> > > écrit :
> > >
> > > > Christophe,
> > > > very good initiative!
> > > >
> > > > I support it
> > > > Some comments inline below
> > > >
> > > >
> > > > Enrico
> > > >
> > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
> > > > <bornet.ch...@gmail.com> ha scritto:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > I have drafted PIP-208: HTTP Sink
> > > > >
> > > > > PIP link:
> > > > > https://github.com/apache/pulsar/issues/17719
> > > > >
> > > > > Here's a copy of the contents of the GH issue for your references:
> > > > >
> > > > > ### Motivation
> > > > >
> > > > > Currently, when you want to consume from Pulsar topics in
> > applications
> > > > > written in languages that don't have a Pulsar driver supported, you
> > need
> > > > to
> > > > > run some type of proxy like the WebSocket Proxy or Pulsar Beam. In
> > > > > production this needs additional effort to deploy, scale, load
> > balance,
> > > > > monitor, and so on...
> > > > > Pulsar IO is a framework that deals with all these operational
> > subjects
> > > > and
> > > > > can be leveraged to provide a way to push messages to external
> > systems
> > > > > using HTTP, a protocol supported by every existing language and OS.
> > > > >
> > > > > ### Goal
> > > > >
> > > > > This proposal defines an HTTP Sink that sends the messages to a
> > > > configured
> > > > > URL.
> > > > > It takes inspiration from [Pulsar Beam](
> > > > > https://github.com/kafkaesque-io/pulsar-beam) and the [Confluent
> > HTTP
> > > > Sink
> > > > > connector](
> > > > >
> > https://docs.confluent.io/kafka-connectors/http/current/overview.html).
> > > > >
> > > > >
> > > > > ### Implementation
> > > > >
> > > > > A `pulsar-io-http` module will be added to `pulsar-io`.
> > > > > On building the project `pulsar-io-http-{version}.nar` will be
> built
> > and
> > > > > added to the `pulsar-all` distribution.
> > > > > The name of the Sink will be `http`.
> > > > >
> > > > > The HTTP Sink pushes records to any HTTP server with the record
> > value in
> > > > > the body of a POST method.
> > > > > The body of the HTTP request is the JSON representation of the
> record
> > > > value.
> > > >
> > > > What do you mean ?
> > > > I think that this should depend on the Schema.
> > > >
> > > > BYTES SCHEMA -> I would push the raw message payload
> > > > PRIMITIVE VALUES (long, integer, string) - > I would push the JSON
> > > > represantation
> > > > JSON SCHEMA ->  push the raw message payload
> > > > AVRO -> ?  convert to JSON ?
> > > > PROTOBUF -> ? convert to JSON ?
> > > > KEY-VALUE ?
> > > >
> > > > Probably we need some flag to define the behaviour for the non
> trivial
> > > > cases.
> > > >
> > > > The current impl chooses to serialize as JSON because it's a well
> > > supported content-type on the server frameworks.
> > > It's also to be consistent with existing HTTP Sinks such as Pulsar Bean
> > and
> > > Confluent HTTP Sink Connector.
> > > The possibility to adapt the content-type to the schema is elegant and
> > will
> > > probably result in shorter payloads (but less readable) and I think it
> > > could be done as a follow-up option.
> > > It has indeed the problem of being difficult to do for KV schema.
> > > For the content-type mappings I would do:
> > > BYTES SCHEMA -> application/octet-stream (raw bytes)
> > > PRIMITIVE VALUES (long, integer, string) - > text/plain
> > > JSON ->  application/json
> > > AVRO -> avro/binary
> > > PROTOBUF -> probably application/octet-stream ?
> > > KEY-VALUE ?
> > >
> > > Would also need to indicate the Schema-Type in the HTTP headers.
> > >
> > >
> > > >
> > > > >
> > > > > Some headers are added to the HTTP request:
> > > > > * `PulsarTopic`: the topic of the record
> > > > > * `PulsarKey`: the key of the record
> > > > > * `PulsarEventTime`: the event time of the record
> > > > > * `PulsarPublishTime`: the publish time of the record
> > > > > * `PulsarMessageId`: the ID of the message contained in the record
> > > > > * `PulsarProperties-*`: each record property is passed with the
> > property
> > > > > name prefixed by `PulsarProperties-`
> > > > >
> > > >
> > > > Can we make the "Content-Type" configurable ?
> > > >
> > > Yes we can. But do we do it for the first iteration ?
> > > If we do it, I would have an option to add some fix headers and the
> user
> > > can override the content-type.
> > > If we go for a variable content-type depending on the schema, then we
> > could
> > > have a map<SchemaType, content-type>
> > >
> > > > Can we make the HTTP METHOD configurable ?
> > > >
> > > Yes we can. But do we do it for the first iteration ?
> > >
> > > >
> > > > > ### Alternatives
> > > > >
> > > > > Creating a separated project for this Sink is rejected since:
> > > > > * this Sink is very useful for developers to test the Pulsar IO
> > > > framework,
> > > > > transform functions, and to make demos.
> > > > > * the code has a very small footprint with no external
> dependencies.
> > > > > * it should be visible at the same level as other sinks
> > > >
> > > > 100% agreed !
> > > >
> > > > >
> > > > > I'm looking forward the discussion.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Christophe Bornet
> > > >
> >
>

Re: [DISCUSS] PIP-208: HTTP Sink

Reply via email to