The root cause appears to be different than the geoip database download in 
Elastic.
By default, Elastic will stop writes when the disk usage goes over 90%. I've 
now added a setting to disable the disk usage threshold in the PR [1].
A similar setting is applied in elastic-github-actions [2].
Once the build passes for the PR [3], I'll proceed with merging it to unblock 
Pulsar CI.

-Lari

[1] - 
https://github.com/lhotari/pulsar/commit/d959eb4929d4192fb56c140a8b590e0ba25d866b
[2] - 
https://github.com/elastic/elastic-github-actions/blob/562b8b6ae4677da97273ff6bc4d630ce96ecbaa5/elasticsearch/run-elasticsearch.sh#L41
[3] - https://github.com/apache/pulsar/pull/20671

On 2023/06/28 13:05:30 tison wrote:
> > I guess nobody proceeded in disabling the test.
> 
> Yeah. I'm not in a hurry but bring up the case. It seems no one is blocked
> urgently and we have time to investigate it :D
> 
> Thanks for your investigation and patch! Indeed.
> 
> Best,
> tison.
> 
> 
> Lari Hotari <lhot...@apache.org> 于2023年6月28日周三 20:58写道:
> 
> > I guess nobody proceeded in disabling the test.
> >
> > I have investigated the problem and written a short guide about
> > investigating integration tests
> > in the real GitHub Actions VM environment using ssh.
> > This guide is a comment on the issue:
> > https://github.com/apache/pulsar/issues/20661#issuecomment-1611216464
> >
> > While investigating the failing test, the test started suddenly passing
> > and I couldn't reproduce the issue so I didn't catch the problem yet. This
> > also means that the problem is transient.
> >
> > I suspect that it's the geoip database download that Elastic container
> > does at startup time which is causing issues. There's also an elastic issue
> > #92335 about the default geoip download [1]. This can be disabled by
> > setting `ingest.geoip.downloader.enabled` to `false` in the container
> > environment.
> >
> > geoip download might not be the root cause, but I'm now testing a change
> > that disables the geoip database download and enables logging for Elastic
> > container stdout and stderr output.
> >
> > The PR is https://github.com/apache/pulsar/pull/20671 .
> >
> > -Lari
> >
> > [1] https://github.com/elastic/elasticsearch/pull/92335
> >
> > On 2023/06/28 01:52:14 tison wrote:
> > > See also https://github.com/apache/pulsar/issues/20661
> > >
> > > Enrico and I both verified that it works well locally, so that can be an
> > > env issue or unstable dependency - I checked the ES image not changed,
> > > though.
> > >
> > > If we cannot locate the cause quickly, perhaps disable the test to
> > unblock
> > > other PRs first?
> > >
> > > I tried to read the code, but there is no trivial cause (even the test
> > > passed locally). The log indicates that statistics received one message
> > > instead of 20 expected, but as other test cases passed, it may not be a
> > > kernel logic issue.
> > >
> > > Best,
> > > tison.
> > >
> >
> 

Reply via email to