Hi Guile Users! For my project of implementing a decision tree algorithm, which I ported to Guile, I am trying ways of parallelizing the algorithm. Since I did not make use of mutation or global state in the algorithm, it should be fairly simple to parallelize: "Simply split execution into 2 parts at every split of the tree and let those run on different cores." – That's at least the theory. I am using Guile 2.2.6, installed via Guix package manager.
However, I think there might be a bug in Guile's implementation of the parallel forms. There is still a small chance, that somehow the bug is in my code and I overlooked something, that would only matter in a concurrent or parallel scenario. The commit id is: https://notabug.org/ZelphirKaltstahl/guile-ml/src/d61a9535508264dd065760b20caa25a820ed00be I wrote a more detailed bug report in my todo file: https://notabug.org/ZelphirKaltstahl/guile-ml/src/d61a9535508264dd065760b20caa25a820ed00be/todo.org (view the non-rendered version, notabug does not render org files correctly) Here is why I am saying that there might be a bug in the parallel forms somewhere: * When I run the algorithm without concurrency or parallelism, no error happens. Especially all procedures receive correct types of arguments and run without problem. * As stated, I took care to not use mutation anywhere. I am not aware of any mutation in my code. I think almost, if not all, participating procedures have referential transparency. * The bug only happens sometimes and only when using parallel forms. (I have yet to try fibers and futures.) * The project has passing tests, which include fitting the decision tree. Although they do not use the same data set, the structure of the data is the same. No "wrong type" errors happen. The tests do not yet cover every single procedure, but the most important once, I think. * Once the parallel forms are involved, I get the following errors seemingly at random, perhaps approximately every 10th run of the very same script, which makes use of the algorithm. I seemingly get them more often when running on my core 2 duo machine than on my Lenovo Thinkpad X200, which has 2 cores. I can sometimes run the algorithm 5-10 times without error and then the next time I run it, an error will happen: --------8<--------8<-------- scheme@(guile-user)> (define tree (fit #:train-data (shuffle-dataset banking-dataset #:seed 12345) #:feature-column-indices (list 0 1 2 3) #:label-column-index 4 #:max-depth 5 #:min-data-points 12 #:min-data-points-ratio 0.02 #:min-impurity-split (expt 10 -7) #:stop-at-no-impurity-improvement #t)) recursive split on depth: 1 will check for stop conditions now checking for stop condition: max-depth-reached? checking for stop condition: insufficient-data-points-for-split? checking for stop condition: insufficient-data-points-ratio-for-split? checking for stop condition: all-same-label? INFO: CONTINUING SPLITT: still got 1372 data points ice-9/threads.scm:289:22: In procedure loop: In procedure <: Wrong type argument in position 1: 0.39142673091293256 Entering a new prompt. Type `,bt' for a backtrace or `,q' to continue. -------->8-------->8-------- This never happens when not running concurrently and also does not happen on every run. Also I think floats are OK for `<` to handle. I am using my own procedures with correct arguments. Somehow Guile gets confused. Here is another one: --------8<--------8<-------- scheme@(guile-user)> (define tree (fit #:train-data (shuffle-dataset banking-dataset #:seed 12345) #:feature-column-indices (list 0 1 2 3) #:label-column-index 4 #:max-depth 5 #:min-data-points 12 #:min-data-points-ratio 0.02 #:min-impurity-split (expt 10 -7) #:stop-at-no-impurity-improvement #t)) recursive split on depth: 1 will check for stop conditions now checking for stop condition: max-depth-reached? checking for stop condition: insufficient-data-points-for-split? checking for stop condition: insufficient-data-points-ratio-for-split? checking for stop condition: all-same-label? INFO: CONTINUING SPLITT: still got 1372 data points ice-9/threads.scm:289:22: In procedure loop: In procedure <: Wrong type argument in position 1: 0.39142673091293256 Entering a new prompt. Type `,bt' for a backtrace or `,q' to continue. scheme@(guile-user) [1]> ,bt In current input: 161:2 3 (_) In decision-tree.scm: 218:18 2 (recursive-split (#(-2.7028 1.6327 0.83598 -0.091393 1) #(-1.3931 1.5664 7.5382 0.78403 0) #(-5.4414 7.2363 0.10938 -7.5642 1) #(-2.7908 -5.7133 5.953 0.45946 1) #(-3.518 2.8763 0.1548 # 1) # …) …) 110:13 1 (get-best-split _ _ _ #:split-quality-proc _) In ice-9/threads.scm: 289:22 0 (loop _) scheme@(guile-user) [1]> -------->8-------->8-------- It even claims that I cannot use `<` with a floating point number, which is weird, I think: In procedure <: Wrong type argument in position 1: 0.39142673091293256 And here another one: --------8<--------8<-------- [03:05:29]:[~/development/Guile/guile-ml]: guile --debug -L . run.scm recursive split on depth: 1 will check for stop conditions now checking for stop condition: max-depth-reached? checking for stop condition: insufficient-data-points-for-split? checking for stop condition: insufficient-data-points-ratio-for-split? checking for stop condition: all-same-label? INFO: CONTINUING SPLITT: still got 1372 data points calculating best split for feature columns (0 1 2 3) inside get-best-split / let calculating best split for feature column 0 in get-best-split-for-column with column 0 calculating best split for feature column 1 calculating best split for feature column 2 in get-best-split-for-column with column 1 in get-best-split-for-column with column 1 in get-best-split-for-column with column 2 calculating best split for feature column 3 in get-best-split-for-column with column 3 Backtrace: 13 (apply-smob/1 #<catch-closure 1cbb940>) In ice-9/boot-9.scm: 705:2 12 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>) In ice-9/eval.scm: 619:8 11 (_ (#(-0.2361 9.3221 2.1307 -4.3793 0) #(-2.4115 -9.1359 9.3444 -0.65259 1) #(# # # …) …)) In ice-9/boot-9.scm: 2312:4 10 (save-module-excursion _) 3832:12 9 (_) In run.scm: 95:2 8 (_) In decision-tree.scm: 236:18 7 (recursive-split (#(-2.7028 1.6327 0.83598 -0.091393 1) #(-1.3931 1.5664 7.5382 # 0) # …) …) 124:13 6 (get-best-split _ _ _ #:split-quality-proc _) In ice-9/threads.scm: 288:21 5 (loop _) In decision-tree.scm: 128:30 4 (_ _) 86:34 3 (get-best-split-for-column (#(-2.7028 1.6327 0.83598 -0.091393 1) #(-1.3931 1.5664 …) …) …) 31:8 2 (split-data _ _ _) In unknown file: 1 (partition #<procedure 1c67b40 at decision-tree.scm:31:27 (data-point)> (#(-2.7028 # …) …)) In decision-tree.scm: 32:29 0 (_ _) decision-tree.scm:32:29: In procedure <: Wrong type: #(-0.72068 -6.7583 5.8408 0.62369 1) -------->8-------->8-------- The log shows at least a place in my code, where supposedly the wrong argument gets in. But there should not be such a wrong argument, as the structure of a data set for the algorithm is always a list of vectors as data points. After adding more debug logging another one: --------8<--------8<-------- will compare the following values: 0.4565034338822441 and 0.5763181258315033 wwill compare the following values: 0.4812939489026074 and 0.7189958135901167 before recurring in iter-col-vals in select-min-cost-split will compare the following values: 0.48037557849768603 and 0.4984472198525817 before recurring in iter-col-vals in select-min-cost-split will compare the following values: 0.4938631012588292 and 0.9781115994890177 before recurring in iter-col-vals in select-min-cost-split will compare the following values: before recurring in iter-col-vals0.48037557849768603 and and e recurring in iter-col-vals0.480375578497686030.6404180444583852 in select-min-cost-split will compare the following values: 0.4565034338822441 and 0.5299176721713759 before recurring in iter-col-vals in select-min-cost-split will compare the following values: (Segmentation fault (core dumped) -------->8-------->8-------- That one is even a segmentation fault, which I thought could never happen from my algorithm, as I am only using "normal" Guile stuff + libraries, no low level manipulations of bits or fiddling with OS threads or anything like that. It should come somewhere from the underlying stuff of parallel forms. Or perhaps from too excessive logging using (display …). I also got one in combination with the usage of `<` and the segmentation fault: --------8<--------8<-------- got back result for column Backtrace: 8 (apply-smob/1 #<catch-closure 130b940>) In ice-9/boot-9.scm: 705:2 7 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>) In ice-9/eval.scm: 619:8 6 (_ #(#(#<directory (guile-user) 1397140>))) In ice-9/boot-9.scm: 2312:4 5 (save-module-excursion _) 3832:12 4 (_) In run.scm: 95:2 3 (_) In decision-tree.scm: 239:18 2 (recursive-split (#(-2.7028 1.6327 0.83598 -0.091393 1) #(-1.3931 1.5664 7.5382 # 0) # …) …) 127:13 1 (get-best-split _ _ _ #:split-quality-proc _) In ice-9/threads.scm: 289:22 0 (loop _) ice-9/threads.scm:289:22: In procedure loop: In procedure <: Wrong type argument in position 1: Segmentation fault (core dumped) -------->8-------->8-------- None of this has ever happened when running single core, without concurrency. It is also strange, that it does not seem to happen deterministically at a specific time, like always at the beginning or first split, but can happen at any time during fitting the tree to the training data. I still have to try using fibers instead of parallel forms. I don't really know how to best debug this, besides adding debug logs and seeing how far it comes before getting the errors. But that is not reliable, because multiple threads try to display logs and the outputs could be mixed. I already have test cases and they all pass. Any advice on how to debug such an issue is welcome. I've looked at this for hours, but so far could not find anything standing out, possibly causing this bug in my code. Regards, Zelphir