2020-01-27 12:38:00 UTC - Antti Kaikkonen: Do queries in Pulsar SQL always read all messages of a topic or is it possible to use the message ordering to quickly query messages before/after/between specific publish times or sequence ids? ---- 2020-01-27 14:30:27 UTC - Atri Sharma: @Atri Sharma has joined the channel ---- 2020-01-27 16:52:45 UTC - John Pradeep: @John Pradeep has joined the channel ---- 2020-01-27 17:09:01 UTC - Sijie Guo: (3) should be fairly straightforward to add an admin rest api to support this operation. It is irrelative to whether the sequence id is based on producer or topic.
(1) and (2) are more closely related. The challenging part for (1) is how to define what is the last seen sequence id if there are two competing producers producing same set of sequence ids. The question of (2) is more similar to the problem in stream processing - “how to handle out-of-order event times”. The challenge is to define what are the out-of-order events. If “out-of-order” happens, how broker should wait for the “out-of-order” events and how long it should wait for. At the first glance, you need a general approach for 2) to support 1). In order to implement 2) semantically correct, you most likely have to enforce the producers to produce monotonically sequence ids (no gaps so brokers know how to identify what aer the out-of-order events and how to “wait”/“sort”). This will put a big constraint to producers which might make this approach only work for a certain workflow. ---- 2020-01-27 17:10:21 UTC - Paul Danckaert: @Paul Danckaert has joined the channel ---- 2020-01-27 17:10:35 UTC - Sijie Guo: It does predicate push down to reduce the amount of data to process. The predicate push down currently is based on publish time (and message id if I remember correctly). ---- 2020-01-27 17:13:34 UTC - Antti Kaikkonen: Ok, thanks! So simply using publish time in the where clause should work? ---- 2020-01-27 17:48:27 UTC - Sijie Guo: yes 100 : Antti Kaikkonen ---- 2020-01-27 20:32:26 UTC - Eugen: I agree, the gapless sequence ids (similar to how tcp works) are a prerequisite for this, and it will limit this feature's utility to a small set of use cases. This will of course have to be enabled via topic or namespace settings. Regarding the development process - is this something that I should just start working on and cteate PRs, or would a PIP be in place for this? (In terms of priority for my current project, this is not the most urgent, as I have a plan B and a plan C.) ---- 2020-01-27 20:53:28 UTC - Eugen: In terms of configuration for out-of-order sorting there would have to be at least a max-wait-interval setting, so that after that interval has elapsed, the gap will manifest im the topic, and if the missing seq id(s) appear subsequently, they will get silently discarded ---- 2020-01-28 01:53:38 UTC - Sijie Guo: @Eugen for small improvement and tasks, you can just go ahead with filing pull requests. for large task or changing user facing API or changes that can break compatibilities, a PIP is recommended before starting the actual work. +1 : Eugen ---- 2020-01-28 01:58:35 UTC - Eugen: A question about PRs - do you want them once a feature is ready, or can I create one while I'm working on it, perhaps with "WIP" added to the commit messages? I'm specifically asking for the "feature compatibility matrix", which takes either a) document perusal / experimentation or b) experience. So I thought I create a PR with as much of the matrix as I know filled in, and reviewers experienced with Pulsar could fill in the blanks. ---- 2020-01-28 01:59:33 UTC - Sijie Guo: a WIP for filling up the matrix should be good. +1 : Eugen ---- 2020-01-28 03:23:37 UTC - Rahul: @Rahul has joined the channel ----