Stephan: The links were in the other email from vinay.
> On Feb 21, 2017, at 10:46 AM, Stephan Ewen <se...@apache.org> wrote: > > Hi! > > I cannot find the screenshots you attached. > The Apache Mailing lists sometimes don't support attachments, can you link to > the screenshots some way else? > > Stephan > > >> On Mon, Feb 20, 2017 at 8:36 PM, vinay patil <vinay18.pa...@gmail.com> wrote: >> Hi Stephan, >> >> Just saw your mail while I was explaining the answer to your earlier >> questions. I have attached some more screenshots which are taken from the >> latest run today. >> Yes I will try to set it to higher value and check if performance improves >> >> Let me know your thoughts >> >> Regards, >> Vinay Patil >> >>> On Tue, Feb 21, 2017 at 12:51 AM, Stephan Ewen [via Apache Flink User >>> Mailing List archive.] <[hidden email]> wrote: >>> @Vinay! >>> >>> Just saw the screenshot you attached to the first mail. The checkpoint that >>> failed came after one that had an incredible heavy alignment phase (14 GB). >>> I think that working that off threw the next checkpoint because the workers >>> were still working off the alignment backlog. >>> >>> I think you can for now fix this by setting the minimum pause between >>> checkpoints a bit higher (it is probably set a bit too small for the state >>> of your application). >>> >>> Also, can you describe what your sources are (Kafka / Kinesis or file >>> system)? >>> >>> BTW: We are currently working on >>> - incremental RocksDB checkpoints >>> - the network stack to allow in the future for a new way of doing the >>> alignment >>> >>> Both of that should help that the program is more resilient to these >>> situations. >>> >>> Best, >>> Stephan >>> >>> >>> >>>> On Mon, Feb 20, 2017 at 7:51 PM, Stephan Ewen <[hidden email]> wrote: >>>> Hi Vinay! >>>> >>>> Can you start by giving us a bit of an environment spec? >>>> >>>> - What Flink version are you using? >>>> - What is your rough topology (what operations does the program use) >>>> - Where is the state (windows, keyBy)? >>>> - What is the rough size of your checkpoints and where does the time go? >>>> Can you attach a screenshot from >>>> https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/checkpoint_monitoring.html >>>> - What is the size of the JVM? >>>> >>>> Those things would be helpful to know... >>>> >>>> Best, >>>> Stephan >>>> >>>> >>>>> On Mon, Feb 20, 2017 at 7:04 PM, vinay patil <[hidden email]> wrote: >>>>> Hi Xiaogang, >>>>> >>>>> Thank you for your inputs. >>>>> >>>>> Yes I have already tried setting MaxBackgroundFlushes and >>>>> MaxBackgroundCompactions to higher value (tried with 2, 4, 8) , still not >>>>> getting expected results. >>>>> >>>>> System.getProperty("java.io.tmpdir") points to /tmp but there I could not >>>>> find RocksDB logs, can you please let me know where can I find it ? >>>>> >>>>> Regards, >>>>> Vinay Patil >>>>> >>>>>> On Mon, Feb 20, 2017 at 7:32 AM, xiaogang.sxg [via Apache Flink User >>>>>> Mailing List archive.] <[hidden email]> wrote: >>>>>> Hi Vinay >>>>>> >>>>>> Can you provide the LOG file in RocksDB? It helps a lot to figure out >>>>>> the problems becuse it records the options and the events happened >>>>>> during the execution. Otherwise configured, it should locate at the path >>>>>> set in System.getProperty("java.io.tmpdir"). >>>>>> >>>>>> Typically, a large amount of memory is consumed by RocksDB to store >>>>>> necessary indices. To avoid the unlimited growth in the memory >>>>>> consumption, you can put these indices into block cache (set >>>>>> CacheIndexAndFilterBlock to true) and properly set the block cache size. >>>>>> >>>>>> You can also increase the number of backgroud threads to improve the >>>>>> performance of flushes and compactions (via MaxBackgroundFlushes and >>>>>> MaxBackgroudCompactions). >>>>>> >>>>>> In YARN clusters, task managers will be killed if their memory >>>>>> utilization exceeds the allocation size. Currently Flink does not count >>>>>> the memory used by RocksDB in the allocation. We are working on >>>>>> fine-grained resource allocation (see FLINK-5131). It may help to avoid >>>>>> such problems. >>>>>> >>>>>> May the information helps you. >>>>>> >>>>>> Regards, >>>>>> Xiaogang >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------ >>>>>> 发件人:Vinay Patil <[hidden email]> >>>>>> 发送时间:2017年2月17日(星期五) 21:19 >>>>>> 收件人:user <[hidden email]> >>>>>> 主 题:Re: Checkpointing with RocksDB as statebackend >>>>>> >>>>>> Hi Guys, >>>>>> >>>>>> There seems to be some issue with RocksDB memory utilization. >>>>>> >>>>>> Within few minutes of job run the physical memory usage increases by 4-5 >>>>>> GB and it keeps on increasing. >>>>>> I have tried different options for Max Buffer Size(30MB, 64MB, 128MB , >>>>>> 512MB) and Min Buffer to Merge as 2, but the physical memory keeps on >>>>>> increasing. >>>>>> >>>>>> According to RocksDB documentation, these are the main options on which >>>>>> flushing to storage is based. >>>>>> >>>>>> Can you please point me where am I doing wrong. I have tried different >>>>>> configuration options but each time the Task Manager is getting killed >>>>>> after some time :) >>>>>> >>>>>> Regards, >>>>>> Vinay Patil >>>>>> >>>>>> On Thu, Feb 16, 2017 at 6:02 PM, Vinay Patil <[hidden email]> wrote: >>>>>> I think its more of related to RocksDB, I am also not aware about >>>>>> RocksDB but reading the tuning guide to understand the important values >>>>>> that can be set >>>>>> >>>>>> Regards, >>>>>> Vinay Patil >>>>>> >>>>>> On Thu, Feb 16, 2017 at 5:48 PM, Stefan Richter [via Apache Flink User >>>>>> Mailing List archive.] <[hidden email]> wrote: >>>>>> What kind of problem are we talking about? S3 related or RocksDB >>>>>> related. I am not aware of problems with RocksDB per se. I think seeing >>>>>> logs for this would be very helpful. >>>>>> >>>>>> Am 16.02.2017 um 11:56 schrieb Aljoscha Krettek <[hidden email]>: >>>>>> >>>>>> [hidden email] and [hidden email] could this be the same problem that >>>>>> you recently saw when working with other people? >>>>>> >>>>>> On Wed, 15 Feb 2017 at 17:23 Vinay Patil <[hidden email]> wrote: >>>>>> Hi Guys, >>>>>> >>>>>> Can anyone please help me with this issue >>>>>> >>>>>> Regards, >>>>>> Vinay Patil >>>>>> >>>>>> On Wed, Feb 15, 2017 at 6:17 PM, Vinay Patil <[hidden email]> wrote: >>>>>> Hi Ted, >>>>>> >>>>>> I have 3 boxes in my pipeline , 1st and 2nd box containing source and s3 >>>>>> sink and the 3rd box is window operator followed by chained operators >>>>>> and a s3 sink >>>>>> >>>>>> So in the details link section I can see that that S3 sink is taking >>>>>> time for the acknowledgement and it is not even going to the window >>>>>> operator chain. >>>>>> >>>>>> But as shown in the snapshot ,checkpoint id 19 did not get any >>>>>> acknowledgement. Not sure what is causing the issue >>>>>> >>>>>> Regards, >>>>>> Vinay Patil >>>>>> >>>>>> On Wed, Feb 15, 2017 at 5:51 PM, Ted Yu [via Apache Flink User Mailing >>>>>> List archive.] <[hidden email]> wrote: >>>>>> What did the More Details link say ? >>>>>> >>>>>> Thanks >>>>>> >>>>>> > On Feb 15, 2017, at 3:11 AM, vinay patil <[hidden email]> wrote: >>>>>> > >>>>>> > Hi, >>>>>> > >>>>>> > I have kept the checkpointing interval to 6secs and minimum pause >>>>>> > between >>>>>> > checkpoints to 5secs, while testing the pipeline I have observed that >>>>>> > that >>>>>> > for some checkpoints it is taking long time , as you can see in the >>>>>> > attached >>>>>> > snapshot checkpoint id 19 took the maximum time before it gets failed, >>>>>> > although it has not received any acknowledgements, now during this >>>>>> > 10minutes >>>>>> > the entire pipeline did not make any progress and no data was getting >>>>>> > processed. (For Ex : In 13minutes 20M records were processed and when >>>>>> > the >>>>>> > checkpoint took time there was no progress for the next 10minutes) >>>>>> > >>>>>> > I have even tried to set max checkpoint timeout to 3min, but in that >>>>>> > case as >>>>>> > well multiple checkpoints were getting failed. >>>>>> > >>>>>> > I have set RocksDB FLASH_SSD_OPTION >>>>>> > What could be the issue ? >>>>>> > >>>>>> > P.S. I am writing to 3 S3 sinks >>>>>> > >>>>>> > checkpointing_issue.PNG >>>>>> > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/n11640/checkpointing_issue.PNG> >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > View this message in context: >>>>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640.html >>>>>> > Sent from the Apache Flink User Mailing List archive. mailing list >>>>>> > archive at Nabble.com. >>>>>> If you reply to this email, your message will be added to the discussion >>>>>> below: >>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11641.html >>>>>> To start a new topic under Apache Flink User Mailing List archive., >>>>>> email [hidden email] >>>>>> To unsubscribe from Apache Flink User Mailing List archive., click here. >>>>>> NAML >>>>>> >>>>>> >>>>>> >>>>>> If you reply to this email, your message will be added to the discussion >>>>>> below: >>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11673.html >>>>>> To start a new topic under Apache Flink User Mailing List archive., >>>>>> email [hidden email] >>>>>> To unsubscribe from Apache Flink User Mailing List archive., click here. >>>>>> NAML >>> >>> >>> >>> If you reply to this email, your message will be added to the discussion >>> below: >>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11731.html >>> To start a new topic under Apache Flink User Mailing List archive., email >>> [hidden email] >>> To unsubscribe from Apache Flink User Mailing List archive., click here. >>> NAML >>> >>> >>> View this message in context: Re: Checkpointing with RocksDB as statebackend >>> >>> Sent from the Apache Flink User Mailing List archive. mailing list archive >>> at Nabble.com. >> >> >> >> >> If you reply to this email, your message will be added to the discussion >> below: >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11758.html >> To start a new topic under Apache Flink User Mailing List archive., email >> [hidden email] >> To unsubscribe from Apache Flink User Mailing List archive., click here. >> NAML >> >> >> View this message in context: Re: Checkpointing with RocksDB as statebackend >> Sent from the Apache Flink User Mailing List archive. mailing list archive >> at Nabble.com. >