On 21/10/2022 07:56, Daniel Henrique Barboza wrote:
Matheus,
I did some digging yesterday. There are 2 distinct things happening:
- the apparent problem with the avocado test. After doing more and more
tests
it seems like the test failure rate is lower than 10%. With a simple script
to exercise it in my laptop:
n=1
while [ 1 ]; do
make -j check-avocado \
AVOCADO_TESTS='tests/avocado/replay_kernel.py:ReplayKernelNormal.test_ppc64_e500' ;
if [ $? -ne 0 ]; then
echo "test failed after $n interactions"
exit 1
fi
((n=n+1))
done
In master I managed to get up to 100+ runs without failure. Sometimes I
get 90,
50, 30 runs before failure and so on. This is an OK failure rate in my
opinion,
so if any code contribution does not dramatically increase this failure
rate I'm
fine with it. This also means that I'll not be skipping the test.
Thanks for this testing, I suspect we may have more than one bug that
causes this test failure.
- back to this series, I couldn't manage to get a single successful run
with
patch 27 applied. On the other hand, running the aforementioned script with
patches 1-26 I just got 96 test runs before the first failure. This is
enough
evidence for me to believe that, yeah, patch 27 is really doing
something that is
messing with the icount replay for e500 one way or the other.
Patch 27 is definitely wrong - other places that write in special
registers and SPRs that may cause an interrupt (e.g.,
gen_helper_store_decr, gen_mtmsr[d]) call gen_io_start, so we also
should use it before helper_ppc_maybe_interrupt. Without that call, we
hit the cpu_abort in icount_handle_interrupt when using icount if
writee[i] unmasks a pending interrupt.
The current writee[i] may be wrong in not calling it too, as it may
cause an interrupt to be delivered. However, before the interrupt
rework, CPU_INTERRUPT_HARD was set somewhere else, so it wouldn't
trigger the abort.
That said, even after adding this call I still see failures after ~200
iterations of this test, so we may have more problems to tackle here.
However, it's not a CPU abort anymore, the second QEMU invocation exits
with zero without writing anything to the console.
All that said, patches 1-26 are queued in ppc-next.
On 10/20/22 10:40, Matheus K. Ferst wrote:
On 20/10/2022 08:18, Daniel Henrique Barboza wrote:
On 10/19/22 18:55, Daniel Henrique Barboza wrote:
Matheus,
This series fails 'make check-avocado' in an e500 test. This is the
error output:
Scrap that.
This avocado test is also failing on master 10% of the time, give or
take.
It might be case that patch 27 makes the failure more consistent, but
I can't
say it's the culprit.
I'll take a closer look and see if I can diagnose one particular
commit that
is making the patch fail 1 out of 10 times. It can be case where I
might need
to skip the test altogether.
Nice catch. I guess we need a gen_icount_io_start before calling
helper_ppc_maybe_interrupt, so maybe it's better to make a
gen_ppc_maybe_interrupt that calls icount and the helper. I'll give it
a bit more testing and re-spin the series.
Don't need to re-spin everything (unless you needed to do some changes in
the patches prior). Just resend patch 27+.
Ok, I'll send 27-29 with based on ppc-next.
Thanks,
Matheus K. Ferst
Instituto de Pesquisas ELDORADO <http://www.eldorado.org.br/>
Analista de Software
Aviso Legal - Disclaimer <https://www.eldorado.org.br/disclaimer.html>