On 1/21/2019 8:56 PM, Matt Jadud wrote:
Hi all,

I have too much code for a "minimal working example."

Every time I run a distributed places program, I get different results. Sadly, it's complex, and I'm confident there are multiple places I could be missing something. This is all running on a 256-core machine, and my distributed places are all ssh connections to localhost. Each worker spawned gets a unique port. I get non-deterministic results regardless of whether I am running 2 or 64 or 128 places.

1. I have a queen bee process that reads in IDs from a database table.
2. Workers request, via distributed place channel, an ID.
3. The queen (which has a separate router for each worker running in separate threads) sends an ID. This is dequeued from a semaphore-protected queue. 4. The workers do queries based on that ID, some arithmetic calculations, and then send a serializable struct back to the queen (received by the router assigned to that worker). All comms, actually, use serializable structs, and I match on those structs at both ends. 5. The queen uses a channel (local/internal) to send that struct to a thread managing a queue. 6. Because there are multiple writers to that internal channel, a semaphore is used within the processing thread to protect the queue. 7. A final thread picks elements off the queue (protected by the same semaphore), and those structs are written into a database. There is a separate DB for storing the results. There are no contention issues between the table from which data is read during runtime and the DB where data is stored at the end.

It's... not the simplest structure.

No kidding?

My initial wild guess is that the problem is in the queen.  Going by the description it's way overly complicated, with plenty of opportunities for something to get lost.

Your whole architecture seems overly complicated for what it does. Why is the coordinator (queen) so intimately involved?  Why not have the workers process an ID end to end [read the source table, calculate, update the result table]?  All the coordinator should do is load balance and ensure that all the IDs actually get processed.


That said, the comms between queen/worker are deterministic; from ID fetch to final result, there is no non-determinism. It is a bit of back-and-forth, but unrolled, it is a pipeline pattern. So, between queen and worker, there should be no impediments to communication, and no non-deterministic/branching communications.

If I understand channels in Racket correctly, I can have multiple writers to a single channel end, so although there is non-determinism in the routers writing to the queuing thread, they should all be serviced. The queue itself is, therefore, doubly-protected, once by the comms structure, and once by the semaphore: it should not be possible for two threads to ever be in the queue, either to insert or dequeue. The database tables are either for reading or writing: never the two shall meet.

Is the *source* table ever being updated outside of your program?
And what is the DBMS?


 1. If a worker chokes on something, will I ever know? I don't believe I have a logging/error interface set up for the workers. (Besides... if they die, how do they communicate that back over a channel before dying?) If they threw an exception, would I hear about it while I'm watching the queen run? 2. Is it possible for a distributed channel to drop data silently? Should I add logging that lets me trace all of my comms from Q->W and back again?

I'm not aware that DP channels can drop messages - they are built on top of TCP for reliability.  The whole idea is that places may be widely separated and communicating by Internet.

If a remote place crashes, the only way to know is that it doesn't respond.  Instrumenting both sides would be best, but it should be simple enough for the queen to check that there is 1:1 correspondence between work assigned and results returned.

Of course, that doesn't give any assurance that the results are correct.


3. Should I have a set of logs (for queen and all workers) that let me reconstruct all comms/computations? I can do this, but I was hoping someone would say "X possibly fails in Y way...", and that would provide insight before I do all of that instrumentation work.

Ultimately, I think the heart of my questions have to do with the way workers might silently die (if they can), and whether it is possible that distributed channels can silently drop data without my queen process knowing. (The latter would seem like an awfully big thing, and so I sincerely doubt that's the issue.) Those are two big holes in my mental map of how distributed places work, and I've spent enough time cleaning up/refactoring/simplifying things that I thought I would ask the community at this point.

There are myriad ways a distributed program can screw up.  I can't claim that the Racket place infrastructure is bullet-proof, but it certainly is well tested.  Any bugs are far more likely to be in your code. :-)


George

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to