Slack digest for #dev - 2020-01-28

Apache Pulsar Slack Tue, 28 Jan 2020 01:11:34 -0800

2020-01-27 12:38:00 UTC - Antti Kaikkonen: Do queries in Pulsar SQL always read 
all messages of a topic or is it possible to use the message ordering to 
quickly query messages before/after/between specific publish times or sequence 
ids?
----
2020-01-27 14:30:27 UTC - Atri Sharma: @Atri Sharma has joined the channel
----
2020-01-27 16:52:45 UTC - John Pradeep: @John Pradeep has joined the channel
----
2020-01-27 17:09:01 UTC - Sijie Guo: (3) should be fairly straightforward to 
add an admin rest api to support this operation. It is irrelative to whether 
the sequence id is based on producer or topic.


(1) and (2) are more closely related.
The challenging part for (1) is how to define what is the last seen sequence id 
if there are two competing producers producing same set of sequence ids.
The question of (2) is more similar to the problem in stream processing - “how 
to handle out-of-order event times”. The challenge is to define what are the 
out-of-order events. If “out-of-order” happens, how broker should wait for the 
“out-of-order” events and how long it should wait for.

At the first glance, you need a general approach for 2) to support 1). In order 
to implement 2) semantically correct, you most likely have to enforce the 
producers to produce monotonically sequence ids (no gaps so brokers know how to 
identify what aer the out-of-order events and how to “wait”/“sort”). This will 
put a big constraint to producers which might make this approach only work for 
a certain workflow.
----
2020-01-27 17:10:21 UTC - Paul Danckaert: @Paul Danckaert has joined the channel
----
2020-01-27 17:10:35 UTC - Sijie Guo: It does predicate push down to reduce the 
amount of data to process. The predicate push down currently is based on 
publish time (and message id if I remember correctly).
----
2020-01-27 17:13:34 UTC - Antti Kaikkonen: Ok, thanks! So simply using publish 
time in the where clause should work?
----
2020-01-27 17:48:27 UTC - Sijie Guo: yes
100 : Antti Kaikkonen
----
2020-01-27 20:32:26 UTC - Eugen: I agree, the gapless sequence ids (similar to 
how tcp works) are a prerequisite for this, and it will limit this feature's 
utility to a small set of use cases. This will of course have to be enabled via 
topic or namespace settings. Regarding the development process - is this 
something that I should just start working on and cteate PRs, or would a PIP be 
in place for this? (In terms of priority for my current project, this is not 
the most urgent, as I have a plan B and a plan C.)
----
2020-01-27 20:53:28 UTC - Eugen: In terms of configuration for out-of-order 
sorting there would have to be at least a max-wait-interval setting, so that 
after that interval has elapsed, the gap will manifest im the topic, and if the 
missing seq id(s) appear subsequently, they will get silently discarded
----
2020-01-28 01:53:38 UTC - Sijie Guo: @Eugen for small improvement and tasks, 
you can just go ahead with filing pull requests. for large task or changing 
user facing API or changes that can break compatibilities, a PIP is recommended 
before starting the actual work.
+1 : Eugen
----
2020-01-28 01:58:35 UTC - Eugen: A question about PRs - do you want them once a 
feature is ready, or can I create one while I'm working on it, perhaps with 
"WIP" added to the commit messages? I'm specifically asking for the "feature 
compatibility matrix", which takes either a) document perusal / experimentation 
or b) experience. So I thought I create a PR with as much of the matrix as I 
know filled in, and reviewers experienced with Pulsar could fill in the blanks.
----
2020-01-28 01:59:33 UTC - Sijie Guo: a WIP for filling up the matrix should be 
good.
+1 : Eugen
----
2020-01-28 03:23:37 UTC - Rahul: @Rahul has joined the channel
----

Slack digest for #dev - 2020-01-28

Reply via email to