-1 Changing the memory consumption between minor releases should not happen.
The good news: Robert ran a test with the latest 1.1 branch that contains a fix for the changed RocksDB memory configuration and reported stable behaviour. @Flavio: I agree, but since we're already very late with this bugfix release I would not like to wait for the PR to be merged. We can include it in 1.1.5, which can follow very soon imo. I hope that's OK for you. On Fri, Dec 16, 2016 at 11:07 AM, Gyula Fóra <gyula.f...@gmail.com> wrote: > @Robert > > -I am not sure if the RocksDB problems are closely related to the version > upgrade, I have been experiencing similar problems for months. This is > usually not a huge problem on YARN I think, it mostly hurts in standalone > clusters. > -Also the yarn memory limits are tricky to configure nicely as it depends a > lot on how rocks handles native memory. It seems to grow quite a lot over > time. > > > Flavio Pompermaier <pomperma...@okkam.it> ezt írta (időpont: 2016. dec. > 16., P, 10:56): > >> I personally think that it should be quite important to have a fix also for >> the ES connector (https://issues.apache.org/jira/browse/FLINK-5122). >> >> Best, >> Flavio >> >> On Fri, Dec 16, 2016 at 10:43 AM, Robert Metzger <rmetz...@apache.org> >> wrote: >> >> > I'm not sure if we can release the release candidate like this, because >> I'm >> > running into two issues probably related to a recent rocksdb version >> > upgrade. >> > >> > This is my list of points so far: >> > >> > - Checked the staging repository. Quickstarts and Hadoop 1 / 2 are okay. >> > - Build a job against the staging repository >> > - Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans >> and >> > WordCount batch jobs >> > - Executed a heavy, misbehaved streaming job for a few hours. While >> running >> > that job, I found that: >> > - Not all checkpoint directories are cleaned up in HDFS (I use the >> async >> > rocksdb statebackend) >> > - segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all >> > happening in the last minutes) >> > - "beyond physical memory limits" container killings from YARN (I know >> we >> > can configure this, I just wonder what if we should change the default >> > value) >> > - the segfaults and memory limits caused the job to not run anymore in >> > the end because it was in a constant retry loop. >> > - This is not a blocking issue I found during the testing: >> > https://issues.apache.org/jira/browse/FLINK-5345 >> > - This is also a non blocking issue for 1.1.4 (fixed for 1.2) >> > https://issues.apache.org/jira/browse/FLINK-4631 >> > >> > >> > Let me know if we should release anyways or fix these issues first. >> > >> > >> > On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <u...@apache.org> wrote: >> > >> > > Dear Flink community, >> > > >> > > Please vote on releasing the following candidate as Apache Flink >> version >> > > 1.1.4. >> > > >> > > The commit to be voted on: >> > > 2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579) >> > > >> > > Branch: >> > > release-1.1.4-rc3 >> > > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin >> > > k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3) >> > > >> > > The release artifacts to be voted on can be found at: >> > > http://people.apache.org/~uce/flink-1.1.4-rc3/ >> > > >> > > The release artifacts are signed with the key with fingerprint >> 9D403309: >> > > http://www.apache.org/dist/flink/KEYS >> > > >> > > The staging repository for this release can be found at: >> > > https://repository.apache.org/content/repositories/orgapacheflink-1109 >> > > >> > > ------------------------------------------------------------- >> > > >> > > The voting time is at least three days and the vote passes if a >> > > majority of at least three +1 PMC votes are cast. The vote ends >> earliest >> > > on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST). >> > > >> > > [ ] +1 Release this package as Apache Flink 1.1.4 >> > > [ ] -1 Do not release this package, because ... >> > > >> > >>