Hi all,

I would really appreciate some comments and criticism on my cells
implementation, which I've uploaded to this group's files as
dataflow.clj . My goal was to fully exploit the available concurrency,
but not do any unnecessary cell evaluations.


My solution is lazy rather than push-based; if you change a ref that
cells depend on, nothing happens to the cells until you actually ask
for their values. If you call

    (evaluate [cell1 cell2 cell3 ...])

first cell1, cell2, cell3, ... and all their ancestors have their
values set to {:updating}, and the root cells of the computation (all
ancestors of cell1 etc. whose parents are non-cell identities like
refs) are identified. Then, messages are sent to the root cells
telling them to recompute their values.

Each time a cell p is able to compute its value, it sends a message to
each of its children c that needs to update. The message extends c's
value, which is a map, with the key and value {p (p's value)}. If c
has received reports from all its cell parents, it retrieves values of
its non-cell parents, computes its value, and reports to its children.
When cell1, cell2, cell3 ... are all able to compute, the computation
is over. You can optionally call

    (sync-evaluate [cell1 cell2 cell3 ...])

which blocks until the computation finishes using a Java countdown
latch.


This approach (I think) guarantees that each cell recomputes at most
once per computation. This helps certain models perform much better
than with push-based cells. For example, think about m generations of
n cells each, where every cell in a generation is a parent of every
cell in the next generation. Say all the root cells depend on a single
ref. If you change the value of the ref, all nm cells need to update.
If you do it with pushes, you first have to update all the root cells,
for n updates. Then each root cell pushes to all of its children, for
n^2 updates. You end up doing waaay more than nm updates.

You could update the generations sequentially to get back to good
performance... but I feel like my approach will incur less overhead
than finding all of the generations.


The example in the testing section of dataflow.clj is less dramatic.
The dependency graph, with -> pointing from parents to child, is:

a b -> c
a c -> d
a -> e
c e -> f

The roots are a and b, and each update function takes 1s.

With my implementation, it runs in 3s with 6 updates:
- a and b update together
- c and e update together
- d and f update together

whereas if you were using a push-based scheme I think it would take 4s
with 9 updates. It would take 4s because:
- a and b update together
- c has to update twice in sequence, once in response to a and once in
response to b
- d and f have to update together in response to c's final update.


My implementation runs the test, but:
- It's pretty ugly,
- I'd eventually like it to be fast & am sure it's nowhere near its
potential,
- I feel like I'm not taking full advantage of the language.

After the basic idea gets some scrutiny, I'd like to include a system
for handling exceptions. I'll also make the cells maintain caches
keyed by their non-cell ancestors. This will save a lot of
coordination overhead in addition to actual computation because if you
evaluate a cell twice in a row it won't have to bother its cell
ancestors at all the second time.

I haven't given much thought to circular dependencies, because I don't
currently need them.


Thanks very much!

Anand
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to