Re: Flink streaming with 1+ TB of managed state

2016-11-22 Thread Gyula Fóra
Hi Steven, Let me go try to address your questions :) 1. We take checkpoints approximately every hour for these large states to remove some strain from our networks. Obviously with incremental checkpoints we would go down to every couple of minutes. 2. We don't have anything additional and you a

Re: Flink streaming with 1+ TB of managed state

2016-11-21 Thread Steven Ruppert
Some responses inline below: On Sat, Nov 19, 2016 at 4:07 PM, Gyula Fóra mailto:gyula.f...@gmail.com>> wrote: Hi Steven, As Robert said some of our jobs have state sizes around a TB or more. We use the RocksDB state backend with some configs tuned to perform well on SSDs (you can get some tips

Re: Flink streaming with 1+ TB of managed state

2016-11-21 Thread Stephan Ewen
Some background in the Incremental Checkpointing: It is not in the system, but we have a quite advanced design and some committers/contributors are currently starting the effort. My personal estimate is that it would be available in some months (Q1 next year). Best, Stephan On Sat, Nov 19, 2016

Re: Flink streaming with 1+ TB of managed state

2016-11-19 Thread Gyula Fóra
Hi Steven, As Robert said some of our jobs have state sizes around a TB or more. We use the RocksDB state backend with some configs tuned to perform well on SSDs (you can get some tips here: https://www.youtube.com/watch?v=pvUqbIeoPzM). We checkpoint our state to Ceph (similar to HDFS but this is

Re: Flink streaming with 1+ TB of managed state

2016-11-19 Thread Robert Metzger
Hi Steven, According to this presentation, King.com is using Flink with terabytes of state: http://flink-forward.org/wp-content/uploads/2016/07/Gyulo-Fo%CC%81ra-RBEA-Scalable-Real-Time-Analytics-at-King.compressed.pdf (see Page 4 specifically) For the 90GB experiment, what is the expected time fo

Flink streaming with 1+ TB of managed state

2016-11-18 Thread Steven Ruppert
Hi, Is anybody currently running flink streaming with north of a terabyte (TB) of managed state? If you are, can you share your experiences wrt hardware, tuning, recovery situations, etc? I'm evaluating flink for a use case I estimate will take around 5TB of state in total, but looking at the act