Hi,

On Thu, Feb 09, 2023 at 10:20:36AM +0000, Richard Purdie wrote:
> On Thu, 2023-02-09 at 10:09 +0200, Mikko Rapeli wrote:
> > When qemu machine hangs, the ssh commands done by tests
> > are not timing out. do_testimage() task has last logs like this:
> > 
> > DEBUG: time: 1673531086.3155053, endtime: 1673531686.315502
> > 
> > The test process is stuck for hours, or for ever if the
> > executing command or test case did not set a timeout correctly.
> > The default 300 second timeout is not working when target hangs.
> > Note that timeout is really a "inactive timeout" since data returned
> > by the process will reset the timeout.
> > 
> > Make the process stdout non-blocking so read() will always return
> > right away using os.set_blocking() available in python 3.5 and later.
> > 
> > Then change from python codec reader to plain read() and make
> > the ssh subprocess stdout non-blocking. Even with select()
> > making sure the file had input to be read, the codec reader was
> > trying to find more stuff and blocking for ever when process hangs.
> > 
> > While at it, add a small timeout to read data in larger chunks if
> > possible. This avoids reading data one or few characters at a time
> > and makes the debug logs more readable.
> > 
> > close() the stdout file in all cases after read loop is complete.
> > 
> > Then make sure to wait or kill the ssh subprocess in all cases.
> > Just reading the output stream and receiving EOF there does not mean
> > that the process exited, and wait() needs a timeout if the process
> > is hanging. In the end kill the process and return the return value
> > and captured output utf-8 encoded, just like before these changes.
> > 
> > This fixes ssh run() related deadlocks when a qemu target hangs
> > completely.
> > 
> > Signed-off-by: Mikko Rapeli <mikko.rap...@linaro.org>
> > ---
> >  meta/lib/oeqa/core/target/ssh.py | 33 +++++++++++++++++++++++++-------
> >  1 file changed, 26 insertions(+), 7 deletions(-)
> > 
> > diff --git a/meta/lib/oeqa/core/target/ssh.py 
> > b/meta/lib/oeqa/core/target/ssh.py
> > index 13fd5b2a49..466a795eb4 100644
> > --- a/meta/lib/oeqa/core/target/ssh.py
> > +++ b/meta/lib/oeqa/core/target/ssh.py
> > @@ -229,22 +229,28 @@ def SSHCall(command, logger, timeout=None, **opts):
> >          if timeout:
> >              endtime = starttime + timeout
> >              eof = False
> > +            os.set_blocking(process.stdout.fileno(), False)
> >              while time.time() < endtime and not eof:
> > -                logger.debug('time: %s, endtime: %s' % (time.time(), 
> > endtime))
> >                  try:
> > +                    logger.debug('Waiting for process output: time: %s, 
> > endtime: %s' % (time.time(), endtime))
> >                      if select.select([process.stdout], [], [], 5)[0] != []:
> > -                        reader = codecs.getreader('utf-8')(process.stdout, 
> > 'ignore')
> > -                        data = reader.read(1024, 4096)
> > +                        # wait a bit for more data, tries to avoid reading 
> > single characters
> > +                        time.sleep(0.2)
> > +                        data = process.stdout.read()
> >                          if not data:
> > -                            process.stdout.close()
> >                              eof = True
> >                          else:
> > -                            output += data
> > -                            logger.debug('Partial data from SSH call:\n%s' 
> > % data)
> > +                            # ignore errors to capture as much as possible
> > +                            string = data.decode('utf-8', errors='ignore')
> 
> I've been trying to remember why we have the reader here and  I think
> the issue was you can't call decode on partial data :(.

I get this. Sadly readers read more than file has available and they don't have
timeouts...

> If it happens that there is a multi byte character on the stream
> crossing the boundary then you'll get errors. "ignoring" them just
> means you'll corrupt data. The case is rare but with long running
> commands with utf8 data like some of the ptests, it can be an issue.
>
> That is why the code was written the way it was...

For logging purposes errors are fine. Should the raw data be captured to
output then? Or another conversion to utf-8 after all data has been
read?

Reading more than one character at a time for sure helps here but
granted there may be windows where we hit the in character bytes..

Cheers,

-Mikko
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#176940): 
https://lists.openembedded.org/g/openembedded-core/message/176940
Mute This Topic: https://lists.openembedded.org/mt/96849157/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to