Hi all,

I have too much code for a "minimal working example."

Every time I run a distributed places program, I get different results.
Sadly, it's complex, and I'm confident there are multiple places I could be
missing something. This is all running on a 256-core machine, and my
distributed places are all ssh connections to localhost. Each worker
spawned gets a unique port. I get non-deterministic results regardless of
whether I am running 2 or 64 or 128 places.

1. I have a queen bee process that reads in IDs from a database table.
2. Workers request, via distributed place channel, an ID.
3. The queen (which has a separate router for each worker running in
separate threads) sends an ID. This is dequeued from a semaphore-protected
queue.
4. The workers do queries based on that ID, some arithmetic calculations,
and then send a serializable struct back to the queen (received by the
router assigned to that worker). All comms, actually, use serializable
structs, and I match on those structs at both ends.
5. The queen uses a channel (local/internal) to send that struct to a
thread managing a queue.
6. Because there are multiple writers to that internal channel, a semaphore
is used within the processing thread to protect the queue.
7. A final thread picks elements off the queue (protected by the same
semaphore), and those structs are written into a database. There is a
separate DB for storing the results. There are no contention issues between
the table from which data is read during runtime and the DB where data is
stored at the end.

It's... not the simplest structure. That said, the comms between
queen/worker are deterministic; from ID fetch to final result, there is no
non-determinism. It is a bit of back-and-forth, but unrolled, it is a
pipeline pattern. So, between queen and worker, there should be no
impediments to communication, and no non-deterministic/branching
communications.

If I understand channels in Racket correctly, I can have multiple writers
to a single channel end, so although there is non-determinism in the
routers writing to the queuing thread, they should all be serviced. The
queue itself is, therefore, doubly-protected, once by the comms structure,
and once by the semaphore: it should not be possible for two threads to
ever be in the queue, either to insert or dequeue. The database tables are
either for reading or writing: never the two shall meet.

 1. If a worker chokes on something, will I ever know? I don't believe I
have a logging/error interface set up for the workers. (Besides... if they
die, how do they communicate that back over a channel before dying?) If
they threw an exception, would I hear about it while I'm watching the queen
run?

2. Is it possible for a distributed channel to drop data silently? Should I
add logging that lets me trace all of my comms from Q->W and back again?

3. Should I have a set of logs (for queen and all workers) that let me
reconstruct all comms/computations? I can do this, but I was hoping someone
would say "X possibly fails in Y way...", and that would provide insight
before I do all of that instrumentation work.

Ultimately, I think the heart of my questions have to do with the way
workers might silently die (if they can), and whether it is possible that
distributed channels can silently drop data without my queen process
knowing. (The latter would seem like an awfully big thing, and so I
sincerely doubt that's the issue.) Those are two big holes in my mental map
of how distributed places work, and I've spent enough time cleaning
up/refactoring/simplifying things that I thought I would ask the community
at this point.

I'm bothered because the DB work and calculations are all so deterministic.
I've only parallelized this because of the volume of data that ultimately
needs to be processed...

Cheers,
Matt

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to