Hi Stephan, I am using Flink 1.2.0 version, and running the job on on YARN using c3.4xlarge EC2 instances having 16 cores and 30GB RAM each.
In total I have set 32 slots and alloted 1200 network buffers I have attached the latest checkpointing snapshot, its configuration, cpu load average ,physical memory usage and heap memory usage here: <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11759/Checkpoints_State.png> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11759/Checkpoint_Configuration.png> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11759/CPU_Load_Physical_memory.png> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11759/Heap_Usage.png> Before I describe the topology I want to let you know that when I enabled object reuse, 32M records (total 64M - two kafka source ) were processed in 17minutes , I did not see much halt in between , how does the object reuse help here , I have used FLASH_SSD_OPTIMIZED option ? This is the best result I have got till now (earlier time was 1hr 3minutes). But I don't understand how did it work ? :) The program use the following operations: 1. Consume Data from two kafka sources 2. Extract the information from the record (flatmap) 3. Write as is data to S3 (sink) 4. Union both the streams and apply tumbling window on it to perform outer join (keyBy->window->apply) 5. Some other operators downstream to enrich the data (map->flatMap->map) 6. Write the enriched data to S3 (sink) I have allocated 8GB of heap space to each TM (find the 4th snap above) Final aim is to test with minimum 100M records. Let me know your inputs Regards, Vinay Patil -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11759.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.