Coordinated, asynchronous modification of state?

Garth Sheldon-Coulson Wed, 21 Apr 2010 08:37:36 -0700

Hi All,

This e-mail is the product of some musing about refs, agents, and atoms. I
would like to offer some thoughts on a possible fourth reference type that
allows coordinated, asynchronous modification of state.


I would greatly appreciate comments from the community. Maybe some of these
ideas have already been discussed on the list or incorporated into
development, in which case I'm sorry for repeating what is already
known---and please point me in the right direction so I can read more. I
hope the ideas are not too far off base, but please let me know if you think
they are.

Let me start at the beginning. In Clojure, we have standard reference types
for:

* uncoordinated, synchronous state change,
* coordinated, synchronous state change,
* uncoordinated, asynchronous state change.

Most Clojure users are familiar with the following diagram expressing the
reference type taxonomy. (The ASCII diagrams in this e-mail are best viewed
in a fixed-width font.)

              | Synchronous  | Asynchronous
--------------+--------------+-----------------
Uncoordinated | Atoms        | Agents
Coordinated   | Refs         |

It occurred to me recently that there is no built-in reference type for the
final quadrant: coordinated, asynchronous change. There is no reference
type, in other words, allowing one to write a "doasync" block similar to a
dosync block:

(doasync   ;short for do asynchronously
  (send a inc)
  (send b inc))

causing a and b to change together atomically in the background. I thought
to myself that it would be nice to have this ability in some applications,
especially simulation, artificial life, and agent-based modeling. It allows
one thread to effect coordinated change in other threads while getting on
with its own work.

I pondered more and realized that, although there is no built-in reference
type for this purpose, one can make a first stab at the same effect using a
combination of futures and refs:

(future
  (dosync
    (alter a inc)
    (alter b inc)))

The future wraps the atomic transaction and causes it to take place in a
background thread.

This solution satisfied me for a day or two. Upon some more thought,
however, I realized that combining refs with futures is a recipe for
livelock disaster. Here is why. In a massively concurrent program with heavy
asynchronous state change, it is unlikely that any one ref will stay the
same for very long. The kind of coordinated change offered by a ref
transaction is unlikely to be successful in this kind of constantly-changing
environment, because the values of the refs are constantly being changed out
from under the transaction attempts. Consider the following illustration, in
which I try to modify a ref at the same time that it is being changed
continually from another thread:

;;;;;;;;;;;; begin example ;;;;;;;;;;;;
(def a (ref 1))

; the code for start-frequent-changes is at
; http://gist.github.com/373143
(start-frequent-changes a)

; this code will never print its message
(let [done? (agent nil)]
  (future
    (dosync
      (alter a (fn negate [val] (Thread/sleep 300) (- val)))
        ; negate is a long-running function of its argument
      (send done? (fn [_] (println "Done!"))))))
;;;;;;;;;;;; end example ;;;;;;;;;;;;

The principle illustrated by this example is the following: when one or more
other threads are changing a ref, a computationally expensive transaction on
that ref is likely to encounter livelock. The more expensive the transaction
and the more frequent the changes, the less likely it is that the
transaction will ever go through. It will retry over and over, consuming
resources and potentially retrying forever.

Rich introduced ensure in part to try to mitigate this problem. Livelock can
be avoided in the example above by placing a simple "(ensure a)" call at the
beginning of the dosync block. Ensure ensures that the transaction holds
"write permissions" on the ref during the transaction, so that other threads
cannot modify it.

However, the combination of futures, refs, and ensure is not a terribly good
solution to coordinated asynchronicity, either, because it has a problem of
its own: The order in which futures are executed is not deterministic. Race
conditions will occur, and this is a big problem when dealing with state.

To illustrate this, suppose one has three refs, a, b, and c. (In a real
scenario, instead of just three refs one might have three groups of 20 refs,
but three refs is enough to illustrate.) One wants to complete three
asynchronous transactions on these refs in the following order: first,
modify a, then modify both a and b, then modify both a and c.

The problem is this: There is no way, using futures, refs, and ensure, to
guarantee that the three asynchronous transactions will happen in the order
one intends. Running the following code will usually print 3, but sometimes
it will print 2, because there is a race condition affecting the second and
third futures.

;;;;;;;;;;;; begin example ;;;;;;;;;;;;
(let [a (ref 1)
      b (ref 2)
      c (ref 3)]
  (future (dosync (ensure a) (alter a #(do (Thread/sleep 200) %))))
  (future (dosync (ensure a) (alter a (fn [_] @b)) (alter b identity)))
  (future (dosync (ensure a) (alter a (fn [_] @c)) (alter c identity)))
  (Thread/sleep 300)
  (println @a))
;;;;;;;;;;;; end example ;;;;;;;;;;;;

This race condition should not be surprising, because the whole point of a
futures is to evaluate its body quickly and asynchronously, without any
guarantees as to timing. However, this kind of flexibility is not amenable
to modifications of state. One usually wants modifications of state to
happen in order, because such operations are usually not commutable.

The solution to the race-condition problem already exists in the language,
of course, in the semantics of agents. When using agents, one can rely on
the fact that the functions sent by send and send-off will always be
evaluated in the order they were sent. If one rewrites the last example
using agents, one sees there is never the potential for a race condition.
The following code always prints 3:

;;;;;;;;;;;; begin example ;;;;;;;;;;;;
(let [a (agent 1)
      b (agent 2)
      c (agent 3)]
  (send a #(do (Thread/sleep 200) %))
  (do (send a (fn [_] @b)) (send b identity))
  (do (send a (fn [_] @c)) (send c identity))
  (Thread/sleep 300)
  (println @a))
;;;;;;;;;;;; end example ;;;;;;;;;;;;

The reason this code is safe, as everyone on this list already knows, is
that agents have a built-in queue. Unlike refs, which rely on a CAS+retry
model, agents guarantee that modifications will be queued and take place in
the order they were requested.

All of these observations led me to try to expand my mental model of
Clojure's reference types. Recall the diagram I started with:

              | Synchronous  | Asynchronous
--------------+--------------+-----------------
Uncoordinated | Atoms        | Agents
Coordinated   | Refs         |

This diagram is overly simplistic. In reality, the diagram should have three
dimensions, not two. The three dimensions are:

* synchronous vs. asynchronous
* uncoordinated vs. coordinated
* CAS+retry vs. queued    <--- this dimension is important

Redrawing the diagram in three dimensions, we now have eight cells instead
of four. We can fill in six of the cells by combining the standard reference
types with futures and await:

Uncoordinated:
          | Synchronous    | Asynchronous
----------+----------------+-----------------
CAS+Retry | Atoms          | Futures + atoms
Queued    | Agents + await | Agents

Coordinated:
          | Synchronous    | Asynchronous
----------+----------------+-----------------
CAS+Retry | Refs           | Futures + refs
Queued    | ?              | ?

Looking at this diagram, I realized that as powerful as the existing
reference types are, it is impossible to fill in the ?'s easily using them
alone.

There is no reason this has to be so. I decided to try to prototype a new
reference type to fill in the ?'s. For now, let me call this new reference
type a serf (as in a worker you can order around).

The semantics of serfs are as follows. This is tentative and very much
subject to suggestions and criticism. Just as a dosync block wraps a ref
transaction, a "send-dosync" block wraps a serf transaction. Within a
send-dosync block, serfs are treated just like refs. You can alter and
commute them, just as you would a ref. There are two differences between
serfs and refs. First, the transaction as a whole is executed
asynchronously, not synchronously. Second, each serf has a queue, just like
agents do. Before a given transaction is executed, the transaction must
reach the front of the queue of every serf involved in the transaction. In
other words, the transaction will be held until every serf involved is free.
Accordingly, the transaction will be executed once and only once; it will
never retry, because it will never contend with another transaction. Because
a serf's queue is blocked while waiting for other serfs to get ready, serfs
offer less concurrency than refs+ensure, which offer less concurrency still
than plain refs. However, serfs allow for deterministic order of execution
in the context of asynchronous, coordinated transactions, which is something
neither refs nor agents can provide. For many applications, this fact may be
enough to outweigh the loss of some concurrency.

Here is a simple example. Remember, within each send-dosync block, the serfs
are treated just like refs. (Indeed, by macro magic they *are* refs within
that scope.)

;;;;;;;;;;;; begin example ;;;;;;;;;;;;
(do
  (def foo (serf 0))
  (println "Before send-dosync: " (serf-deref foo))
  (send-dosync [foo] (alter foo inc))
  (Thread/sleep 100)
  (println "Some time after send-dosync: " (serf-deref foo)))
;;;;;;;;;;;; end example ;;;;;;;;;;;;

;;;;;;;;;;;; begin example ;;;;;;;;;;;;
(let [a (serf 1)]
  (send-dosync [a]
    (println "Type of a: " (type a))))  ; prints "clojure.lang.Ref"
;;;;;;;;;;;; end example ;;;;;;;;;;;;

And here is our old example, the one for which refs+ensure resulted in
race-condition behavior:

;;;;;;;;;;;; begin example ;;;;;;;;;;;;
(let [a (serf 1)
      b (serf 2)
      c (serf 3)]
  (send-dosync [a]   (alter a #(do (Thread/sleep 200) %)))
  (send-dosync [a b] (alter a (fn [_] @b)) (alter b identity))
  (send-dosync [a c] (alter a (fn [_] @c)) (alter c identity))
  (Thread/sleep 300)
  (println (serf-deref a)))
;;;;;;;;;;;; end example ;;;;;;;;;;;;

This will always print 3, never 2, because serfs offer deterministic order
of execution.

The prototype code for serf, serf-deref, and send-dosync is at
http://gist.github.com/373416  The syntax is far from pretty. Ideally,
send-dosync would not require a first vector argument at all; it should be
able to extract the relevant serfs from the calls to alter and commute
inside it, just like dosync does. For now, please overlook the shoddy syntax
and code.

I would really appreciate any and all comments on the idea of this reference
type and on the semantics.

As you'll quickly see if you look at the code, serfs are internally just a
ref with an agent standing sentinel. send-dosync is a simple promise-based
function that makes sure all serfs are ready before starting the dosync
transaction on the underlying refs. There's no special STM magic involved.
The semantics of serfs are built up using the semantics of refs, agents, and
promises. Nonetheless, I think the whole may be qualitatively different than
the sum of the parts, because serfs could be said to fill out the "periodic
table" of Clojure reference types:

Uncoordinated:
          | Synchronous    | Asynchronous
----------+----------------+-----------------
CAS+Retry | Atoms          | Futures + atoms
Queued    | Agents + await | Agents

Coordinated:
          | Synchronous    | Asynchronous
----------+----------------+-----------------
CAS+Retry | Refs           | Futures + refs
Queued    | Serfs + await  | Serfs

(In theory, serfs may be able to support CAS+retry semantics and queued
semantics on the same piece of state. However, I'll leave this thought for
another day because I've gone on long enough already.)

If you have read this far, thank you very much.

I look forward to hearing the community's reactions and thoughts.

Sincerely,
Garth

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Coordinated, asynchronous modification of state?

Reply via email to