Bug in Guile 2.2.6 parallel forms implementation?

Zelphir Kaltstahl Tue, 14 Jan 2020 13:56:21 -0800

Hi Guile Users!

For my project of implementing a decision tree algorithm, which I ported
to Guile, I am trying ways of parallelizing the algorithm. Since I did
not make use of mutation or global state in the algorithm, it should be
fairly simple to parallelize: "Simply split execution into 2 parts at
every split of the tree and let those run on different cores." – That's
at least the theory. I am using Guile 2.2.6, installed via Guix package
manager.


However, I think there might be a bug in Guile's implementation of the
parallel forms. There is still a small chance, that somehow the bug is
in my code and I overlooked something, that would only matter in a
concurrent or parallel scenario.

The commit id is:
https://notabug.org/ZelphirKaltstahl/guile-ml/src/d61a9535508264dd065760b20caa25a820ed00be

I wrote a more detailed bug report in my todo file:
https://notabug.org/ZelphirKaltstahl/guile-ml/src/d61a9535508264dd065760b20caa25a820ed00be/todo.org
(view the non-rendered version, notabug does not render org files correctly)

Here is why I am saying that there might be a bug in the parallel forms
somewhere:

* When I run the algorithm without concurrency or parallelism, no error
happens. Especially all procedures receive correct types of arguments
and run without problem.
* As stated, I took care to not use mutation anywhere. I am not aware of
any mutation in my code. I think almost, if not all, participating
procedures have referential transparency.
* The bug only happens sometimes and only when using parallel forms. (I
have yet to try fibers and futures.)
* The project has passing tests, which include fitting the decision
tree. Although they do not use the same data set, the structure of the
data is the same. No "wrong type" errors happen. The tests do not yet
cover every single procedure, but the most important once, I think.
* Once the parallel forms are involved, I get the following errors
seemingly at random, perhaps approximately every 10th run of the very
same script, which makes use of the algorithm. I seemingly get them more
often when running on my core 2 duo machine than on my Lenovo Thinkpad
X200, which has 2 cores. I can sometimes run the algorithm 5-10 times
without error and then the next time I run it, an error will happen:

--------8<--------8<--------
scheme@(guile-user)> (define tree
  (fit #:train-data (shuffle-dataset banking-dataset #:seed 12345)
       #:feature-column-indices (list 0 1 2 3)
       #:label-column-index 4
       #:max-depth 5
       #:min-data-points 12
       #:min-data-points-ratio 0.02
       #:min-impurity-split (expt 10 -7)
       #:stop-at-no-impurity-improvement #t))
recursive split on depth: 1
will check for stop conditions now
checking for stop condition: max-depth-reached?
checking for stop condition: insufficient-data-points-for-split?
checking for stop condition: insufficient-data-points-ratio-for-split?
checking for stop condition: all-same-label?
INFO: CONTINUING SPLITT: still got 1372 data points
ice-9/threads.scm:289:22: In procedure loop:
In procedure <: Wrong type argument in position 1: 0.39142673091293256

Entering a new prompt.  Type `,bt' for a backtrace or `,q' to continue.
-------->8-------->8--------

This never happens when not running concurrently and also does not
happen on every run. Also I think floats are OK for `<` to handle. I am
using my own procedures with correct arguments. Somehow Guile gets confused.

Here is another one:

--------8<--------8<--------
scheme@(guile-user)> (define tree
  (fit #:train-data (shuffle-dataset banking-dataset #:seed 12345)
       #:feature-column-indices (list 0 1 2 3)
       #:label-column-index 4
       #:max-depth 5
       #:min-data-points 12
       #:min-data-points-ratio 0.02
       #:min-impurity-split (expt 10 -7)
       #:stop-at-no-impurity-improvement #t))
recursive split on depth: 1
will check for stop conditions now
checking for stop condition: max-depth-reached?
checking for stop condition: insufficient-data-points-for-split?
checking for stop condition: insufficient-data-points-ratio-for-split?
checking for stop condition: all-same-label?
INFO: CONTINUING SPLITT: still got 1372 data points
ice-9/threads.scm:289:22: In procedure loop:
In procedure <: Wrong type argument in position 1: 0.39142673091293256

Entering a new prompt.  Type `,bt' for a backtrace or `,q' to continue.
scheme@(guile-user) [1]> ,bt
In current input:
    161:2  3 (_)
In decision-tree.scm:
   218:18  2 (recursive-split (#(-2.7028 1.6327 0.83598 -0.091393 1) #(-1.3931 
1.5664 7.5382 0.78403 0) #(-5.4414 7.2363 0.10938 -7.5642 1) #(-2.7908 -5.7133 
5.953 0.45946 1) #(-3.518 2.8763 0.1548 # 1) # …) …)
   110:13  1 (get-best-split _ _ _ #:split-quality-proc _)
In ice-9/threads.scm:
   289:22  0 (loop _)
scheme@(guile-user) [1]>
-------->8-------->8--------

It even claims that I cannot use `<` with a floating point number, which
is weird, I think:

In procedure <: Wrong type argument in position 1: 0.39142673091293256

And here another one:

--------8<--------8<--------
[03:05:29]:[~/development/Guile/guile-ml]: guile --debug -L . run.scm
recursive split on depth: 1
will check for stop conditions now
checking for stop condition: max-depth-reached?
checking for stop condition: insufficient-data-points-for-split?
checking for stop condition: insufficient-data-points-ratio-for-split?
checking for stop condition: all-same-label?
INFO: CONTINUING SPLITT: still got 1372 data points
calculating best split for feature columns (0 1 2 3)
inside get-best-split / let
calculating best split for feature column 0
in get-best-split-for-column with column 0
calculating best split for feature column 1
calculating best split for feature column 2
in get-best-split-for-column with column 1
in get-best-split-for-column with column 1
in get-best-split-for-column with column 2
calculating best split for feature column 3
in get-best-split-for-column with column 3
Backtrace:
          13 (apply-smob/1 #<catch-closure 1cbb940>)
In ice-9/boot-9.scm:
    705:2 12 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>)
In ice-9/eval.scm:
    619:8 11 (_ (#(-0.2361 9.3221 2.1307 -4.3793 0) #(-2.4115 -9.1359 9.3444 
-0.65259 1) #(# # # …) …))
In ice-9/boot-9.scm:
   2312:4 10 (save-module-excursion _)
  3832:12  9 (_)
In run.scm:
     95:2  8 (_)
In decision-tree.scm:
   236:18  7 (recursive-split (#(-2.7028 1.6327 0.83598 -0.091393 1) #(-1.3931 
1.5664 7.5382 # 0) # …) …)
   124:13  6 (get-best-split _ _ _ #:split-quality-proc _)
In ice-9/threads.scm:
   288:21  5 (loop _)
In decision-tree.scm:
   128:30  4 (_ _)
    86:34  3 (get-best-split-for-column (#(-2.7028 1.6327 0.83598 -0.091393 1) 
#(-1.3931 1.5664 …) …) …)
     31:8  2 (split-data _ _ _)
In unknown file:
           1 (partition #<procedure 1c67b40 at decision-tree.scm:31:27 
(data-point)> (#(-2.7028 # …) …))
In decision-tree.scm:
    32:29  0 (_ _)

decision-tree.scm:32:29: In procedure <: Wrong type: #(-0.72068 -6.7583 5.8408 
0.62369 1)
-------->8-------->8--------

The log shows at least a place in my code, where supposedly the wrong
argument gets in. But there should not be such a wrong argument, as the
structure of a data set for the algorithm is always a list of vectors as
data points.

After adding more debug logging another one:

--------8<--------8<--------
will compare the following values: 0.4565034338822441 and 0.5763181258315033
wwill compare the following values: 0.4812939489026074 and 0.7189958135901167
before recurring in iter-col-vals
in select-min-cost-split
will compare the following values: 0.48037557849768603 and 0.4984472198525817
before recurring in iter-col-vals
in select-min-cost-split
will compare the following values: 0.4938631012588292 and 0.9781115994890177
before recurring in iter-col-vals
in select-min-cost-split
will compare the following values: before recurring in 
iter-col-vals0.48037557849768603 and  and e recurring in 
iter-col-vals0.480375578497686030.6404180444583852
in select-min-cost-split
will compare the following values: 0.4565034338822441 and 0.5299176721713759
before recurring in iter-col-vals
in select-min-cost-split
will compare the following values: (Segmentation fault (core dumped)
-------->8-------->8--------

That one is even a segmentation fault, which I thought could never
happen from my algorithm, as I am only using "normal" Guile stuff +
libraries, no low level manipulations of bits or fiddling with OS
threads or anything like that. It should come somewhere from the
underlying stuff of parallel forms. Or perhaps from too excessive
logging using (display …).

I also got one in combination with the usage of `<` and the segmentation
fault:

--------8<--------8<--------
got back result for column
Backtrace:
           8 (apply-smob/1 #<catch-closure 130b940>)
In ice-9/boot-9.scm:
    705:2  7 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>)
In ice-9/eval.scm:
    619:8  6 (_ #(#(#<directory (guile-user) 1397140>)))
In ice-9/boot-9.scm:
   2312:4  5 (save-module-excursion _)
  3832:12  4 (_)
In run.scm:
     95:2  3 (_)
In decision-tree.scm:
   239:18  2 (recursive-split (#(-2.7028 1.6327 0.83598 -0.091393 1) #(-1.3931 
1.5664 7.5382 # 0) # …) …)
   127:13  1 (get-best-split _ _ _ #:split-quality-proc _)
In ice-9/threads.scm:
   289:22  0 (loop _)

ice-9/threads.scm:289:22: In procedure loop:
In procedure <: Wrong type argument in position 1: Segmentation fault (core 
dumped)
-------->8-------->8--------

None of this has ever happened when running single core, without
concurrency. It is also strange, that it does not seem to happen
deterministically at a specific time, like always at the beginning or
first split, but can happen at any time during fitting the tree to the
training data.

I still have to try using fibers instead of parallel forms.

I don't really know how to best debug this, besides adding debug logs
and seeing how far it comes before getting the errors. But that is not
reliable, because multiple threads try to display logs and the outputs
could be mixed. I already have test cases and they all pass.

Any advice on how to debug such an issue is welcome. I've looked at this
for hours, but so far could not find anything standing out, possibly
causing this bug in my code.

Regards,
Zelphir

Bug in Guile 2.2.6 parallel forms implementation?

Reply via email to