Re: [ndctl PATCH v3 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace"

Marc Herbert Fri, 18 Jul 2025 17:22:58 -0700

Thanks Dan for the thorough review, appreciated and...  unfortunately
rare, especially for shell scripts!

On 2025-07-11 17:32, dan.j.willi...@intel.com wrote:
> marc.herbert@ wrote:
>> So, leverage log levels for the PASS/FAIL decision.  This catches all
>> issues and not just the ones printing Call Traces.
>>
>> Add a simple way to exclude expected warnings and errors, either on a
>> per-test basis or globally.
>>
>> Add a way for negative tests to fail when if some expected errors are
>> missing.
>
> Each subsequent "Add" in a changelog is usually an opportunity to split
> the patch into smaller digestable pieces.

As I explained in the cover letter, the problem is: testing. While
logically independent, it's in practice impossible to thoroughly test
these changes independently of each other. That's OK; I'll keep
performing thorough testing in my workspace and I will also make sure no
single, smaller commit introduces any regression in this particular
check_dmesg() area. The current bar is pretty low in this specific
check_dmsg() area so this shouldn't be a problem.

So, in v4 I will submit only the switch from SECONDS to NDTEST_START.
It's 1. logically coming first 2. low risk, low stakes and low
contention. 3. almost half the code: it will be nice to get all that out
of the way.

For that reason, I will for now answer only your comments below that are
related to this first NDTEST_START commit and (preciously) keep your
other comments for a later reply.

>> Stop relying on the magic and convenient but inaccurate $SECONDS bash
>> variable because its precision is (surprise!) about 1 second. In the
>> first version of this patch, $SECONDS was kept and working but it
>> required a 1++ second long "cooldown" between tests to isolate their
>> logs from each other. After dropping $SECONDS from journalctl, no
>> cooldown delay is required.

>>  
>> +time_init()
>> +{
>> +    test "$SECONDS" -le 1 || err 'test/common must be included first!'
>> +    # ... otherwise NDTEST_START is inaccurate
> 
> What is this protecting against... that the test makes sure that
> NDTEST_START happens before any error might have been produced?

Yes - and more generally speaking that no test log will be missing, and
also making sure the test anchors are accurate.

> I think the proposed anchors make this easily debuggable, there are no
> tests that include test/common late, as evidenced by no fixups for this
> in this patch.

There is none now but it's really not that hard to imagine someone
wanting to do random stuff before sourcing test/common. This was not an
issue with $SECONDS and without log anchors; it is one now.

Also, this is just one very simple line; costs practically nothing.

>> +    NDTEST_START=$(LC_TIME=C date '+%F %T.%3N')
>> +
>> +    # Log anchor, especially useful when running tests back to back
>> +    printf "<5>%s@%ds: sourcing test/common: NDTEST_START=%s\n" \
>> +            "$test_basename" "$SECONDS" "$NDTEST_START" > /dev/kmsg
> 
> Why is SECONDS here?

Because test duration is more user-friendly that manually substracting
timestamps; all sorts of people read logs when tests fail and they may
not automatically know that "sourcing common" implies SECONDS is 0 or 1;
it is consistent with other log statements using SECONDS too; it takes
almost no space in the logs; in case the check above gets dropped, this
can become even more useful.

Generally speaking, there is rarely ever "too much" information in
failure logs, only too much "volume" - and SECONDS takes practically
zero space.

> Note that there are some non-shell tests in the suite, not for CXL, but
> might want to make this consistent by following on with wrapping those
> tests in a script.

Good idea!

> Is there a mechanism to opt-out of errors and warnings. Sometimes
> upstream gets overzealous with chatty dmesg and it would be nice to
> quickly check if tests otherwise pass with ignoring messages. Then go
> spend the time to track down new messages tripping up the test.

Good idea (after NDTEST_START).

>> +# notice level to give some information without flooding the (single!)
>> +# testlog.txt file
>> +journalctl_notice()
>> +{
>> +    ( set +x;
>> +      printf ' ------------ More verbose logs at t=%ds ----------\n' 
>> "$SECONDS" )
>> +    journalctl -b --no-pager -o short-precise -p notice --since 
>> "-$((SECONDS*1000 + 1000)) ms"
> 
> Why is SECONDS here and not NDTEST_START?

Because this is approximative (+1000) to give a bit more background and
it's easier to do math on SECONDS in shell scripts than on timestamps.

>>  # $1: line number where this is called
>>  check_dmesg()
>>  {
>> -    # validate no WARN or lockdep report during the run
>> +    local _e_kmsg_no_fail_on=()
>> +    for re in "${kmsg_no_fail_on[@]}" "${kmsg_fail_if_missing[@]}"; do
>> +            _e_kmsg_no_fail_on+=('-e' "$re")
>> +    done
>> +
>> +    # Give some time for a complete kmsg->journalctl flush + any delayed 
>> test effect.
>>      sleep 1
> 
> Feels magical.

It's a "sleep" so it is magical. That does not automatically make it
bad.

> The sleep was only there to make sure that SECONDS rolls over.

The purpose is a bit different now, funny that git diff matched it.

> If a test has delayed effects there is no hard guarantee that 1
> second is sufficient.

There is no such hard guarantee but testing can only prove the existence
of bugs, not their absence. It's not an exact science. This sleep 1 does
not help with effects delayed more than 1s, but it does help with all
delayed effects shorter 1s and that is useful. So, why 1s specifically?
Because 1s is a short time for humans and for most tests in this suite
while being a very long time for computers - plenty enough time for many
timers to expire and for journalctl to get all kernel messages. I can
make it 0.5s, that should work too and save time when running the whole
suite.

>> +    # Log anchor, especially useful when running tests back to back
> 
> This comment is going to go stale, I do not think it helps with
> understanding the implementation.

Sorry, I don't see what specific part could go stale in this
one... Could you elaborate?

>> +    printf "<5>%s@%ds: test/common: check_dmesg() OK\n" "$test_basename" 
>> "$SECONDS" > /dev/kmsg
> 
> I like this, I have long wanted anchors in the log for coordinating
> messages.
> 
>> +
>> +    if "$NDTEST_LOG_DBG"; then
>> +        log_stress from_check_dmesg
>> +    fi
>> +}
>> +# Many tests don't use check_dmesg() (yet?) so double down here. Also, this
> 
> A comment like this belongs in the changelog, not the code. I don't want
> review later patches fixing up the "(yet?)", a comment should help
> understand the present, not comment on some future state.

I will rephrase.

>> +# runs later which is better. But before using this make sure there is
>> +# still no test defining its own EXIT trap.
> 
>> +if "$NDTEST_LOG_DBG"; then
>> +    trap 'log_stress from_trap' EXIT
>> +fi
>> +
>> +log_stress()
>> +{
>> +    printf '<3>%s@%ds: NDTEST_LOG_DBG; trying to break the next 
>> check_dmesg() %s\n' \
>> +            "$test_basename" "$SECONDS" "$1" > /dev/kmsg
>>  }

Re: [ndctl PATCH v3 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace"

Reply via email to