Hello fellow Racketeers,
my spare-time out-of-curiosity venture into using HPR (High-Performance
Racket) for creating a software 3D rendering pipeline seems to be
pushing the futures into rough edges.
The scenario is sort of "usual":
* 7 futures + 1 in RTT that form a binary tree
* GUI thread running
But this time, the futures perform not only data-heavy fixnums
operations, but flonums operations as well.
Something along the lines of 2560x1440 fixnums and the same number of
flonums is being handled in 8 threads effectively (give or take some
optimizations that slightly lower the 1440 height usually).
The code in question is relatively short - say 60 lines of code -
however it does not make much sense without the remaining 2k lines :)
If the operation runs without futures in RTT, nothing happens. But under
a heavy load and VERY varying amount of time (seconds to hours), it
completely freezes with:
* 1 CPU being used at 100% (top/htop shows)
* Does not handle socket operations (X11 WM message for closing the window)
* Does not respond to keyboard (or via kill) SIGINT
* Can only be forcibly stopped by SIGKILL (or similar) or forcefully
closing the window from WM which sort of gets handled probably in the
lower-level parts of GDK completely without Racket runtime intervention
(just prints Killed and the exit code is 137)
Based on these observations I can only conclude that it is the RTT that
gets stuck - but that is only the native thread perspective. From Racket
thread perspective, it can be either the "main" application thread that
is in (thread-wait) for the thread that performs the futures stuff and
it can also be the GUI thread which is created with parameterizing the
eventspace (that is just some trickery to allow me to send breaks when I
receive window close event).
Apart from obvious strace (after freeze) and gdb (before/after freeze)
debugging to find possible sources of this bug, is there even a remote
possibility of getting any clue how can this happen based on the
information gathered so far? My thought go along the lines:
* flonums are boxed - but for some operations they may be immediate
* apparently it is a busy-wait loop in RTT, otherwise 100% CPU usage is
impossible with this workload
* unsafe ops are always suspicious, but again, the problem shows up even
when I switch to the safe versions - it just takes longer time
* which means, the most probable cause is a race condition
And that is basically all I can tell right now.
Of course, any suggestions would be really welcome.
Cheers,
Dominik
P.S.: I am really curious, what will I find when I finally put
fsemaphores into the mix...
--
You received this message because you are subscribed to the Google Groups "Racket
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/racket-users/ca40f468-53c7-6fd2-4e7f-0d963e931a60%40trustica.cz.