Hi,

I am working on a replication system which also works in the browser
(1). So far I have come a long way with core.async and a pub-sub
architecture, but I have avoided error-handling in the beginning, just
using a cascading close on the pub-sub architecture on errors (e.g.
disconnects). Lately I introduced <? and go-try as described here (2)
and extended that to go-loops with dedicated error-channels. But this
never felt right, as error handling in a distributed system should not
be an afterthought. Erlang has a very successful and sound error
handling concept, as Rich also pointed out (3).

The idea is basically that uncaught errors will always happen at some
point and when they do they just propagate and trigger the restart of
whole subsystems. Processes in Erlang terms, which are somewhat an
extended version of go-routines, have ids and are explicitly wired to
each other to receive exit messages when one of them fails.

To retain the robustness of Erlang, I have reduced the concept to the
following 2 requirements:


1) All errors (exceptions) need to be caught and propagated.
2) Resources need to be freed reliably.


Since we operate with channels in core.async and not namend process as
in Erlang, the supervision needs to be passed to routines by some
other means unless go-routines were to be globally registered as in
Erlang. The most natural way is passing a parameter to the routines,
but this can be verbose. To reduce this, one could use bindings, but
these are thread-local and difficult to reason about, which might
break 1).

For requirement 2) the Erlang VM can preemptively terminate processes.
Since this is not possible in core.async, I have decided to inject
"abort" exceptions on every blocking channel op, <?, >?, alt?, ...
This is not perfect, but will eventually throw an exception in every
go-routine and free all resources satisfying 2). The supervisor then
needs to track all running go-routines under its supervision and only
restart the subsystem once all routines are finished. Using the
default try-catch-finally exception mechanism inside of go-routines +
bubbling through <? makes the error-handling somewhat intuitive along
Java/JavaScript semantics.

Example (4):

 (let [test-fn (fn [super]
                  (go-super super
                            (try
                              (<? (timeout 500))
                              (catch Exception e
                                (println "Caught:" (.getMessage e)))
                              (finally
                                (<! (timeout 500))
                                (println "Cleaned up slowly.")))))
        start-fn (fn [super]
                   (go-super super
                             (go-try (throw (ex-info "stale" {})))
                             (test-fn super)
                             (<? (timeout 300))
                             (throw (ex-info "foo" {}))))]
    (<?? (restarting-supervisor start-fn :retries 1 :stale-timeout 100)))

Since one can accidentally leave exceptions stale in some channel
without taking them, we also need to track these and act after some
timeout to satisfy 1).
I am really wondering what other people have thought about error
handling with core.async so far and whether this is the first real
attempt to satisfy these requirements. Since error-handling should be
standardized to compose over library boundaries, I think something
like this should move into core.async eventually. What do you think?

Christian


(1) https://github.com/replikativ/replikativ
(2) http://swannodette.github.io/2013/08/31/asynchronous-error-handling/
(3) http://www.erlang.org/download/armstrong_thesis_2003.pdf
(4)
https://github.com/whilo/full.monty/blob/3ba439c6971c8255d34b2d39f1b7619b3b016236/full.async/src/full/async.clj#L451

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to