The root cause appears to be different than the geoip database download in Elastic. By default, Elastic will stop writes when the disk usage goes over 90%. I've now added a setting to disable the disk usage threshold in the PR [1]. A similar setting is applied in elastic-github-actions [2]. Once the build passes for the PR [3], I'll proceed with merging it to unblock Pulsar CI.
-Lari [1] - https://github.com/lhotari/pulsar/commit/d959eb4929d4192fb56c140a8b590e0ba25d866b [2] - https://github.com/elastic/elastic-github-actions/blob/562b8b6ae4677da97273ff6bc4d630ce96ecbaa5/elasticsearch/run-elasticsearch.sh#L41 [3] - https://github.com/apache/pulsar/pull/20671 On 2023/06/28 13:05:30 tison wrote: > > I guess nobody proceeded in disabling the test. > > Yeah. I'm not in a hurry but bring up the case. It seems no one is blocked > urgently and we have time to investigate it :D > > Thanks for your investigation and patch! Indeed. > > Best, > tison. > > > Lari Hotari <lhot...@apache.org> 于2023年6月28日周三 20:58写道: > > > I guess nobody proceeded in disabling the test. > > > > I have investigated the problem and written a short guide about > > investigating integration tests > > in the real GitHub Actions VM environment using ssh. > > This guide is a comment on the issue: > > https://github.com/apache/pulsar/issues/20661#issuecomment-1611216464 > > > > While investigating the failing test, the test started suddenly passing > > and I couldn't reproduce the issue so I didn't catch the problem yet. This > > also means that the problem is transient. > > > > I suspect that it's the geoip database download that Elastic container > > does at startup time which is causing issues. There's also an elastic issue > > #92335 about the default geoip download [1]. This can be disabled by > > setting `ingest.geoip.downloader.enabled` to `false` in the container > > environment. > > > > geoip download might not be the root cause, but I'm now testing a change > > that disables the geoip database download and enables logging for Elastic > > container stdout and stderr output. > > > > The PR is https://github.com/apache/pulsar/pull/20671 . > > > > -Lari > > > > [1] https://github.com/elastic/elasticsearch/pull/92335 > > > > On 2023/06/28 01:52:14 tison wrote: > > > See also https://github.com/apache/pulsar/issues/20661 > > > > > > Enrico and I both verified that it works well locally, so that can be an > > > env issue or unstable dependency - I checked the ES image not changed, > > > though. > > > > > > If we cannot locate the cause quickly, perhaps disable the test to > > unblock > > > other PRs first? > > > > > > I tried to read the code, but there is no trivial cause (even the test > > > passed locally). The log indicates that statistics received one message > > > instead of 20 expected, but as other test cases passed, it may not be a > > > kernel logic issue. > > > > > > Best, > > > tison. > > > > > >