Seeing numerous HDFS test failures, this affects pr submissions and evaluations.
I'll first roll back pr #6664 and then proceed to run the full suite of unit tests on the server, comparing the unit test results before and after the surefire upgrade. Hopefully, the next upgrade will go smoothly. Many thanks to Ayush for the detailed explanation, and also thanks to Xiaoqiao and Steve. Best Regards, Shilun Fan. On Sat, Jun 8, 2024 at 2:33 PM slfan1989 <slfan1...@apache.org> wrote: > For the issues introduced by #6664, first and foremost, I express my > apologies and regrets. However, I don't quite understand why the > surefire-upgrade would lead to CI crashes and failure to display results. > This PR was updated on March 24th, and while we did observe some failures > in HDFS unit tests, the inability of CI to display results seems to be a > recent occurrence. I'm not sure if this is related to our configuration of > Hadoop On Windows CI. Nonetheless, we still hope to upgrade the surefire > version to a better one. The original 3.0.0-M1 version is quite ancient. > > If we confirm that the issue stems from #6664, I think we can roll back > this PR. A better approach would be to clearly define the conditions > required for us to upgrade the surefire version. > > Best Regards, > Shilun Fan. > > > The build issues in trunk are because of that surefire-upgrade, the > > build crashing & not posting the result on the PR, so reverting that > > should make things work. > > > Checking the PR that Steve mentioned, that is for branch-3.3 and most > > of the failures are due to OOM "unable to create native thread", that > > for some reason is quite random in nature and is with us like since > > stone age and the only solution is retrigger or we can explore > > increasing the memory like in HADOOP-18680 > > > -Ayush > > On Fri, 7 Jun 2024 at 13:51, Xiaoqiao He wrote: > > > > Thanks Ayush for your information. > > > >> So, technically it should be an HDFS induced mess, else I would have > >> reverted by now. > > > > Do you mean revert https://github.com/apache/hadoop/pull/6664? > > If true, +1 to revert it and recover the build flow runs normally. > > cc @Shilun Fan > > > > Best Regards, > > - He Xiaoqiao > > > > On Fri, Jun 7, 2024 at 1:07 PM Ayush Saxena wrote: > >> > >> There are two places where you can tune the memory [1] & [2] > >> > >> I haven't checked again, but I think it is the same old problem, I > >> mentioned that in [3] in the last paragraph where there was some > >> windows failure report. I did drop a comment [4] on that PR telling > >> about those issues, there were some comments on the Jira but I didn't > >> follow... > >> > >> So, technically it should be an HDFS induced mess, else I would have > >> reverted by now :-) > >> > >> Good Luck!!! > >> > >> -Ayush > >> > >> > >> [1] > https://github.com/apache/hadoop/blob/2ee0bf953492b66765d3d2c902407fbf9bceddec/hadoop-project/pom.xml#L172 > >> [2] > https://github.com/apache/hadoop/blob/2ee0bf953492b66765d3d2c902407fbf9bceddec/dev-support/docker/Dockerfile#L77 > >> [3] https://lists.apache.org/thread/hmzl61ow0sbs10p0hky17xxhsggbhc3g > >> [4] https://github.com/apache/hadoop/pull/6664#issuecomment-2082356393 > >> > >> On Fri, 7 Jun 2024 at 08:04, Xiaoqiao He wrote: > >> > > >> > Thanks Steve. Try to trigger CI manually and let's wait what it will > say. > >> > BTW, the flaky tests seem not related to UT logic itself, but most of > them > >> > throw OOM. Not sure if @Ayush Saxena knows how to re-config or tune > >> > the memory of Yetus? > >> > > >> > Best Regards, > >> > - He Xiaoqiao > >> > > >> > On Fri, Jun 7, 2024 at 3:59 AM Steve Loughran wrote: > >> >> > >> >> PR's which trigger hdfs builds seem to hit a lot of hdfs test > failures > >> >> https://github.com/apache/hadoop/pull/6675 > >> >> > >> >> Are these regressions or are the tests flaky? > >> >> > >> >> I don't want commit patches which break things, yet hdfs tests seem > >> >> unreliable and so I'm dangerously tempted to +1 anyway... >