Hello fellow Racketeers,

my spare-time out-of-curiosity venture into using HPR (High-Performance Racket) for creating a software 3D rendering pipeline seems to be pushing the futures into rough edges.

The scenario is sort of "usual":

* 7 futures + 1 in RTT that form a binary tree
* GUI thread running

But this time, the futures perform not only data-heavy fixnums operations, but flonums operations as well.

Something along the lines of 2560x1440 fixnums and the same number of flonums is being handled in 8 threads effectively (give or take some optimizations that slightly lower the 1440 height usually).

The code in question is relatively short - say 60 lines of code - however it does not make much sense without the remaining 2k lines :)

If the operation runs without futures in RTT, nothing happens. But under a heavy load and VERY varying amount of time (seconds to hours), it completely freezes with:

* 1 CPU being used at 100% (top/htop shows)
* Does not handle socket operations (X11 WM message for closing the window)
* Does not respond to keyboard (or via kill) SIGINT
* Can only be forcibly stopped by SIGKILL (or similar) or forcefully closing the window from WM which sort of gets handled probably in the lower-level parts of GDK completely without Racket runtime intervention (just prints Killed and the exit code is 137)

Based on these observations I can only conclude that it is the RTT that gets stuck - but that is only the native thread perspective. From Racket thread perspective, it can be either the "main" application thread that is in (thread-wait) for the thread that performs the futures stuff and it can also be the GUI thread which is created with parameterizing the eventspace (that is just some trickery to allow me to send breaks when I receive window close event).

Apart from obvious strace (after freeze) and gdb (before/after freeze) debugging to find possible sources of this bug, is there even a remote possibility of getting any clue how can this happen based on the information gathered so far? My thought go along the lines:

* flonums are boxed - but for some operations they may be immediate
* apparently it is a busy-wait loop in RTT, otherwise 100% CPU usage is impossible with this workload * unsafe ops are always suspicious, but again, the problem shows up even when I switch to the safe versions - it just takes longer time
* which means, the most probable cause is a race condition

And that is basically all I can tell right now.

Of course, any suggestions would be really welcome.

Cheers,
Dominik

P.S.: I am really curious, what will I find when I finally put fsemaphores into the mix...




--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/ca40f468-53c7-6fd2-4e7f-0d963e931a60%40trustica.cz.

Reply via email to