On Wed, Aug 20, 2025 at 8:49 AM Dilip Modi <[email protected]>
wrote:

> Hello Guacamole community,
>
> We recently encountered a crash in *guacamole-server 1.6.0* while running
> at scale with *300+ concurrent SSH sessions*. The issue consistently
> reproduced when the connection count grew, and after debugging core dumps
> we traced the problem to the *tcp.c event loop implementation*.
> Problem
>
> The existing code in src/libguac/tcp.c used select() with FD_SET to wait
> on sockets. Under heavy load with many connections, FD_SETSIZE
> limitations and the cost of repeatedly rebuilding FD sets caused
> instability. In our case, this eventually led to invalid memory access and
> segmentation faults in the guacd process.
> Change
>
> We replaced the select()-based implementation with poll(), which:
>
>    -
>
>    Removes the dependency on FD_SETSIZE limits.
>    -
>
>    Handles large numbers of file descriptors more efficiently.
>    -
>
>    Avoids the invalid FD_SET bookkeeping that was triggering crashes.
>
> Patch Summary
>
> In tcp.c, we changed the main loop from:
>
> FD_ZERO(&fds);
> FD_SET(fd, &fds);
> if (select(fd + 1, &fds, NULL, NULL, NULL) <= 0)
>     continue;
> if (FD_ISSET(fd, &fds))
>     handle_socket_event(...);
>
> to:
>
>                 /* Use poll() instead of select() */
>                 struct pollfd pfd;
>                 pfd.fd = fd;
>                 pfd.events = POLLOUT;
>
>                 retval = poll(&pfd, 1, timeout * 1000);
>
>                 if (retval > 0) {
>                     int so_error = 0;
>                     socklen_t len = sizeof(so_error);
>                     if (getsockopt(fd, SOL_SOCKET, SO_ERROR, &so_error, &len) 
> < 0 || so_error != 0) {
>                         guac_error = GUAC_STATUS_REFUSED;
>                         guac_error_message = "Socket connect failed.";
>                         close(fd);
>                         continue;
>                     }
>                 }
>             }
>
> Result
>
> After this change, we were able to sustain well beyond 300 concurrent
> sessions without guacd crashing. The server remained stable under stress
> tests and in production workloads.
> ------------------------------
>
> We’d like to share this fix with the community and hear feedback:
>
>    -
>
>    Would it make sense to migrate all similar select() usage in
>    guacamole-server to poll() (or even epoll on Linux) for better
>    scalability?
>    -
>
>    Has anyone else observed similar stability issues with select() under
>    load?
>
>
Dilip,
Thanks, again, for this fix - we appreciate your involvement in the
community. As with other fixes that you've posted to the list, we'd really
appreciate if you'd do the following:
* Request a Jira account and create a Jira issue for the change.
* Fork the Github repository, create a branch for the change, and make the
changes, and submit a pull request so that we can review the changes in the
code, there, and merge the requests.

I think the issues you are finding and suggesting fixes for are quite
valuable, it would just be easier to review the potential changes in pull
requests in Github than to try to evaluate them in e-mails on the mailing
list. The mailing list is definitely the right place to go for general
usage discussion, etc., but for code changes like the above, I think Jira +
Github would be better.

Thanks!
-Nick

Reply via email to