2020-01-26 11:42:32 UTC - Rodrigo Batista: @Rodrigo Batista has joined the 
channel
----
2020-01-26 11:49:25 UTC - Nikita Mathur: @Nikita Mathur has joined the channel
----
2020-01-27 00:24:14 UTC - Eugen: Redundant Pulsar Producers
----
2020-01-27 00:24:28 UTC - Eugen: There is a use case not uncommon in the 
financial world that I think is not satisfactorily handled by Pulsar yet. In 
this scenario, there is a single high-throuput source of data which, for 
reasons of redundancy, sends two copies of the exact same messages via 2 
different routes, in our case two different (sets of) UDP multicast groups. The 
messages on the different groups are identical and share a contiguous sequence 
id. That sequence id is reset every day before market open of the stock 
exchange.

The stock exchange client can completely redundantly (different network 
routers/switches/dedicated line to the exchange etc.) receive the data, and 
under normal circumstances basically discard half of the data received, as it 
is redundant. But as this is UDP, packets may arrive out of order, and there 
may be the occasional loss of a packet on either or even both streams. It would 
be great if we could have two redundant Pulsar producers that just feed the 
messages from the exchange into the Pulsar cluster, and Pulsar would take care 
of deduplication and ordering.

There are, as far as I can tell, 3 (in fact completely orthogonal) features 
missing in Pulsar for this to work:
1. a "deduplicate across producers" option, which would make sure that the last 
seen SeqId is stored and handled per topic, not per producer
2. an deduplication option to sort incoming messages with out-of-order SeqIds
3. an admin (?) function to reset `HighestSequenceId`

So I would like to hear Pulsar developers' opinions about this, and if this is 
perhaps worth a PIP.

N.B. I have implemented above ordering and deduplication logic twice before, so 
I'm familiar with the task, however I'm still pretty much unfamiliar with 
Pulsar, as I just started evaluating whether Pulsar is a good fit for us or not.
----
2020-01-27 05:55:05 UTC - Antti Kaikkonen: @Antti Kaikkonen has joined the 
channel
----
2020-01-27 07:21:02 UTC - Vladimir Shchur: Can you use intermediate topic? So 
all messages will go there without deduplication first, and then third producer 
will use the topic with deduplication.
----
2020-01-27 07:48:47 UTC - Eugen: That was my _Plan B_. Plan B, because it would 
be much more resource-intensive than deduplicating in the outermost ingestion 
step. A single one of my input streams has an average of a several 10k msgs/sec 
and peaks of 250k msg/sec. If we don't handle 2 such streams at the ingestion 
step, resource consumption of both bookie disks and broker cpu would be more 
than double when using the indirection of Plan B.
----

Reply via email to