Hello Guacamole community,

We recently encountered a crash in *guacamole-server 1.6.0* while running
at scale with *300+ concurrent SSH sessions*. The issue consistently
reproduced when the connection count grew, and after debugging core dumps
we traced the problem to the *tcp.c event loop implementation*.
Problem

The existing code in src/libguac/tcp.c used select() with FD_SET to wait on
sockets. Under heavy load with many connections, FD_SETSIZE limitations and
the cost of repeatedly rebuilding FD sets caused instability. In our case,
this eventually led to invalid memory access and segmentation faults in the
guacd process.
Change

We replaced the select()-based implementation with poll(), which:

   -

   Removes the dependency on FD_SETSIZE limits.
   -

   Handles large numbers of file descriptors more efficiently.
   -

   Avoids the invalid FD_SET bookkeeping that was triggering crashes.

Patch Summary

In tcp.c, we changed the main loop from:

FD_ZERO(&fds);
FD_SET(fd, &fds);
if (select(fd + 1, &fds, NULL, NULL, NULL) <= 0)
    continue;
if (FD_ISSET(fd, &fds))
    handle_socket_event(...);

to:

                /* Use poll() instead of select() */
                struct pollfd pfd;
                pfd.fd = fd;
                pfd.events = POLLOUT;

                retval = poll(&pfd, 1, timeout * 1000);

                if (retval > 0) {
                    int so_error = 0;
                    socklen_t len = sizeof(so_error);
                    if (getsockopt(fd, SOL_SOCKET, SO_ERROR,
&so_error, &len) < 0 || so_error != 0) {
                        guac_error = GUAC_STATUS_REFUSED;
                        guac_error_message = "Socket connect failed.";
                        close(fd);
                        continue;
                    }
                }
            }

Result

After this change, we were able to sustain well beyond 300 concurrent
sessions without guacd crashing. The server remained stable under stress
tests and in production workloads.
------------------------------

We’d like to share this fix with the community and hear feedback:

   -

   Would it make sense to migrate all similar select() usage in
   guacamole-server to poll() (or even epoll on Linux) for better
   scalability?
   -

   Has anyone else observed similar stability issues with select() under
   load?

Thanks,
Dilip

-- 


This communication (including any attachments) is intended for the sole 
use of the intended recipient and may contain confidential, non-public, 
and/or privileged material. Use, distribution, or reproduction of this 
communication by unintended recipients is not authorized. If you received 
this communication in error, please immediately notify the sender and then 
delete all copies of this communication from your system.

Reply via email to