Re: [racket-users] Distributed places question

George Neuner Tue, 22 Jan 2019 01:39:27 -0800


On 1/21/2019 8:56 PM, Matt Jadud wrote:

Hi all,

I have too much code for a "minimal working example."
Every time I run a distributed places program, I get differentresults. Sadly, it's complex, and I'm confident there are multipleplaces I could be missing something. This is all running on a 256-coremachine, and my distributed places are all ssh connections tolocalhost. Each worker spawned gets a unique port. I getnon-deterministic results regardless of whether I am running 2 or 64or 128 places.
1. I have a queen bee process that reads in IDs from a database table.
2. Workers request, via distributed place channel, an ID.
3. The queen (which has a separate router for each worker running inseparate threads) sends an ID. This is dequeued from asemaphore-protected queue.4. The workers do queries based on that ID, some arithmeticcalculations, and then send a serializable struct back to the queen(received by the router assigned to that worker). All comms, actually,use serializable structs, and I match on those structs at both ends.5. The queen uses a channel (local/internal) to send that struct to athread managing a queue.6. Because there are multiple writers to that internal channel, asemaphore is used within the processing thread to protect the queue.7. A final thread picks elements off the queue (protected by the samesemaphore), and those structs are written into a database. There is aseparate DB for storing the results. There are no contention issuesbetween the table from which data is read during runtime and the DBwhere data is stored at the end.
It's... not the simplest structure.


No kidding?

My initial wild guess is that the problem is in the queen. Going by thedescription it's way overly complicated, with plenty of opportunitiesfor something to get lost.

Your whole architecture seems overly complicated for what it does. Whyis the coordinator (queen) so intimately involved? Why not have theworkers process an ID end to end [read the source table, calculate,update the result table]? All the coordinator should do is load balanceand ensure that all the IDs actually get processed.

That said, the comms between queen/worker are deterministic; from IDfetch to final result, there is no non-determinism. It is a bit ofback-and-forth, but unrolled, it is a pipeline pattern. So, betweenqueen and worker, there should be no impediments to communication, andno non-deterministic/branching communications.
If I understand channels in Racket correctly, I can have multiplewriters to a single channel end, so although there is non-determinismin the routers writing to the queuing thread, they should all beserviced. The queue itself is, therefore, doubly-protected, once bythe comms structure, and once by the semaphore: it should not bepossible for two threads to ever be in the queue, either to insert ordequeue. The database tables are either for reading or writing: neverthe two shall meet.


Is the *source* table ever being updated outside of your program?
And what is the DBMS?

1. If a worker chokes on something, will I ever know? I don't believeI have a logging/error interface set up for the workers. (Besides...if they die, how do they communicate that back over a channel beforedying?) If they threw an exception, would I hear about it while I'mwatching the queen run?2. Is it possible for a distributed channel to drop data silently?Should I add logging that lets me trace all of my comms from Q->W andback again?

I'm not aware that DP channels can drop messages - they are built on topof TCP for reliability. The whole idea is that places may be widelyseparated and communicating by Internet.

If a remote place crashes, the only way to know is that it doesn'trespond. Instrumenting both sides would be best, but it should besimple enough for the queen to check that there is 1:1 correspondencebetween work assigned and results returned.


Of course, that doesn't give any assurance that the results are correct.

3. Should I have a set of logs (for queen and all workers) that let mereconstruct all comms/computations? I can do this, but I was hopingsomeone would say "X possibly fails in Y way...", and that wouldprovide insight before I do all of that instrumentation work.
Ultimately, I think the heart of my questions have to do with the wayworkers might silently die (if they can), and whether it is possiblethat distributed channels can silently drop data without my queenprocess knowing. (The latter would seem like an awfully big thing, andso I sincerely doubt that's the issue.) Those are two big holes in mymental map of how distributed places work, and I've spent enough timecleaning up/refactoring/simplifying things that I thought I would askthe community at this point.

There are myriad ways a distributed program can screw up. I can't claimthat the Racket place infrastructure is bullet-proof, but it certainlyis well tested. Any bugs are far more likely to be in your code. :-)



George

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Distributed places question

Reply via email to