On 1/21/2019 8:56 PM, Matt Jadud wrote:
Hi all,
I have too much code for a "minimal working example."
Every time I run a distributed places program, I get different
results. Sadly, it's complex, and I'm confident there are multiple
places I could be missing something. This is all running on a 256-core
machine, and my distributed places are all ssh connections to
localhost. Each worker spawned gets a unique port. I get
non-deterministic results regardless of whether I am running 2 or 64
or 128 places.
1. I have a queen bee process that reads in IDs from a database table.
2. Workers request, via distributed place channel, an ID.
3. The queen (which has a separate router for each worker running in
separate threads) sends an ID. This is dequeued from a
semaphore-protected queue.
4. The workers do queries based on that ID, some arithmetic
calculations, and then send a serializable struct back to the queen
(received by the router assigned to that worker). All comms, actually,
use serializable structs, and I match on those structs at both ends.
5. The queen uses a channel (local/internal) to send that struct to a
thread managing a queue.
6. Because there are multiple writers to that internal channel, a
semaphore is used within the processing thread to protect the queue.
7. A final thread picks elements off the queue (protected by the same
semaphore), and those structs are written into a database. There is a
separate DB for storing the results. There are no contention issues
between the table from which data is read during runtime and the DB
where data is stored at the end.
It's... not the simplest structure.
No kidding?
My initial wild guess is that the problem is in the queen. Going by the
description it's way overly complicated, with plenty of opportunities
for something to get lost.
Your whole architecture seems overly complicated for what it does. Why
is the coordinator (queen) so intimately involved? Why not have the
workers process an ID end to end [read the source table, calculate,
update the result table]? All the coordinator should do is load balance
and ensure that all the IDs actually get processed.
That said, the comms between queen/worker are deterministic; from ID
fetch to final result, there is no non-determinism. It is a bit of
back-and-forth, but unrolled, it is a pipeline pattern. So, between
queen and worker, there should be no impediments to communication, and
no non-deterministic/branching communications.
If I understand channels in Racket correctly, I can have multiple
writers to a single channel end, so although there is non-determinism
in the routers writing to the queuing thread, they should all be
serviced. The queue itself is, therefore, doubly-protected, once by
the comms structure, and once by the semaphore: it should not be
possible for two threads to ever be in the queue, either to insert or
dequeue. The database tables are either for reading or writing: never
the two shall meet.
Is the *source* table ever being updated outside of your program?
And what is the DBMS?
1. If a worker chokes on something, will I ever know? I don't believe
I have a logging/error interface set up for the workers. (Besides...
if they die, how do they communicate that back over a channel before
dying?) If they threw an exception, would I hear about it while I'm
watching the queen run?
2. Is it possible for a distributed channel to drop data silently?
Should I add logging that lets me trace all of my comms from Q->W and
back again?
I'm not aware that DP channels can drop messages - they are built on top
of TCP for reliability. The whole idea is that places may be widely
separated and communicating by Internet.
If a remote place crashes, the only way to know is that it doesn't
respond. Instrumenting both sides would be best, but it should be
simple enough for the queen to check that there is 1:1 correspondence
between work assigned and results returned.
Of course, that doesn't give any assurance that the results are correct.
3. Should I have a set of logs (for queen and all workers) that let me
reconstruct all comms/computations? I can do this, but I was hoping
someone would say "X possibly fails in Y way...", and that would
provide insight before I do all of that instrumentation work.
Ultimately, I think the heart of my questions have to do with the way
workers might silently die (if they can), and whether it is possible
that distributed channels can silently drop data without my queen
process knowing. (The latter would seem like an awfully big thing, and
so I sincerely doubt that's the issue.) Those are two big holes in my
mental map of how distributed places work, and I've spent enough time
cleaning up/refactoring/simplifying things that I thought I would ask
the community at this point.
There are myriad ways a distributed program can screw up. I can't claim
that the Racket place infrastructure is bullet-proof, but it certainly
is well tested. Any bugs are far more likely to be in your code. :-)
George
--
You received this message because you are subscribed to the Google Groups "Racket
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.