Bump? 🙏
--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
The output is a bunch of files in parquet format. The thing reading them
would be presto, so I can really tell it to ignore some rows but not
others. Not to mention that the files would keep piling making sql queries
super slow.
On Fri, Jan 18, 2019, 10:01 AM Jamie Grier Sorry my earlier comment
Hmm.. I would have to look into the code for the StreamingFileSink more
closely to understand the concern but typically you should not be concerned
at all with *when* checkpoints happen. They are meant to be a completely
asynchronous background process that has absolutely no bearing on
applicatio
Sorry my earlier comment should read: "It would just read all the files in
order and NOT worry about which data rows are in which files"
On Fri, Jan 18, 2019 at 10:00 AM Jamie Grier wrote:
> Hmm.. I would have to look into the code for the StreamingFileSink more
> closely to understand the conc
Well, the problem is that, conceptually, the way I'm trying to approach
this is ok. But in practice, it has some edge cases.
So back to my original premise: if you both, trigger and checkpoint happen
around the same time, there is a chance that the streaming file sink rolls
the bucket BEFORE it ha
Oh sorry.. Logically, since the ContinuousProcessingTimeTrigger never
PURGES but only FIRES what I said is semantically true. The window
contents are never cleared.
What I missed is that in this case since you're using a function that
incrementally reduces on the fly rather than processing all t
Hello Jamie.
Thanks for taking a look at this. So, yes, I want to write only the last
data for each key every X minutes. In other words, I want a snapshot of the
whole database every X minutes.
> The issue is that the window never get's PURGED so the data just
> continues to accumulate in the wi
If I'm understanding you correctly you're just trying to do some data
reduction so that you write data for each key once every five minutes
rather than for every CDC update.. Is that correct? You also want to keep
the state for most recent key you've ever seen so you don't apply writes
out of ord
Hello there.
So we have some Postgres tables that are mutable, and we want to create a
snapshot of them in S3 every X minutes. So we plan to use Debezium to send a
CDC log of every row change into a Kafka topic, and then have Flink keep the
latest state of each row to save that data into S3 subseq