Re: Dump snapshot of big table in real time using StreamingFileSink

2019-01-24 Thread knur
Bump? 🙏 -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Dump snapshot of big table in real time using StreamingFileSink

2019-01-18 Thread Cristian C
The output is a bunch of files in parquet format. The thing reading them would be presto, so I can really tell it to ignore some rows but not others. Not to mention that the files would keep piling making sql queries super slow. On Fri, Jan 18, 2019, 10:01 AM Jamie Grier Sorry my earlier comment

Re: Dump snapshot of big table in real time using StreamingFileSink

2019-01-18 Thread Jamie Grier
Hmm.. I would have to look into the code for the StreamingFileSink more closely to understand the concern but typically you should not be concerned at all with *when* checkpoints happen. They are meant to be a completely asynchronous background process that has absolutely no bearing on applicatio

Re: Dump snapshot of big table in real time using StreamingFileSink

2019-01-18 Thread Jamie Grier
Sorry my earlier comment should read: "It would just read all the files in order and NOT worry about which data rows are in which files" On Fri, Jan 18, 2019 at 10:00 AM Jamie Grier wrote: > Hmm.. I would have to look into the code for the StreamingFileSink more > closely to understand the conc

Re: Dump snapshot of big table in real time using StreamingFileSink

2019-01-18 Thread Cristian C
Well, the problem is that, conceptually, the way I'm trying to approach this is ok. But in practice, it has some edge cases. So back to my original premise: if you both, trigger and checkpoint happen around the same time, there is a chance that the streaming file sink rolls the bucket BEFORE it ha

Re: Dump snapshot of big table in real time using StreamingFileSink

2019-01-18 Thread Jamie Grier
Oh sorry.. Logically, since the ContinuousProcessingTimeTrigger never PURGES but only FIRES what I said is semantically true. The window contents are never cleared. What I missed is that in this case since you're using a function that incrementally reduces on the fly rather than processing all t

Re: Dump snapshot of big table in real time using StreamingFileSink

2019-01-17 Thread knur
Hello Jamie. Thanks for taking a look at this. So, yes, I want to write only the last data for each key every X minutes. In other words, I want a snapshot of the whole database every X minutes. > The issue is that the window never get's PURGED so the data just > continues to accumulate in the wi

Re: Dump snapshot of big table in real time using StreamingFileSink

2019-01-17 Thread Jamie Grier
If I'm understanding you correctly you're just trying to do some data reduction so that you write data for each key once every five minutes rather than for every CDC update.. Is that correct? You also want to keep the state for most recent key you've ever seen so you don't apply writes out of ord

Dump snapshot of big table in real time using StreamingFileSink

2019-01-17 Thread knur
Hello there. So we have some Postgres tables that are mutable, and we want to create a snapshot of them in S3 every X minutes. So we plan to use Debezium to send a CDC log of every row change into a Kafka topic, and then have Flink keep the latest state of each row to save that data into S3 subseq