Hi all, I have too much code for a "minimal working example."
Every time I run a distributed places program, I get different results. Sadly, it's complex, and I'm confident there are multiple places I could be missing something. This is all running on a 256-core machine, and my distributed places are all ssh connections to localhost. Each worker spawned gets a unique port. I get non-deterministic results regardless of whether I am running 2 or 64 or 128 places. 1. I have a queen bee process that reads in IDs from a database table. 2. Workers request, via distributed place channel, an ID. 3. The queen (which has a separate router for each worker running in separate threads) sends an ID. This is dequeued from a semaphore-protected queue. 4. The workers do queries based on that ID, some arithmetic calculations, and then send a serializable struct back to the queen (received by the router assigned to that worker). All comms, actually, use serializable structs, and I match on those structs at both ends. 5. The queen uses a channel (local/internal) to send that struct to a thread managing a queue. 6. Because there are multiple writers to that internal channel, a semaphore is used within the processing thread to protect the queue. 7. A final thread picks elements off the queue (protected by the same semaphore), and those structs are written into a database. There is a separate DB for storing the results. There are no contention issues between the table from which data is read during runtime and the DB where data is stored at the end. It's... not the simplest structure. That said, the comms between queen/worker are deterministic; from ID fetch to final result, there is no non-determinism. It is a bit of back-and-forth, but unrolled, it is a pipeline pattern. So, between queen and worker, there should be no impediments to communication, and no non-deterministic/branching communications. If I understand channels in Racket correctly, I can have multiple writers to a single channel end, so although there is non-determinism in the routers writing to the queuing thread, they should all be serviced. The queue itself is, therefore, doubly-protected, once by the comms structure, and once by the semaphore: it should not be possible for two threads to ever be in the queue, either to insert or dequeue. The database tables are either for reading or writing: never the two shall meet. 1. If a worker chokes on something, will I ever know? I don't believe I have a logging/error interface set up for the workers. (Besides... if they die, how do they communicate that back over a channel before dying?) If they threw an exception, would I hear about it while I'm watching the queen run? 2. Is it possible for a distributed channel to drop data silently? Should I add logging that lets me trace all of my comms from Q->W and back again? 3. Should I have a set of logs (for queen and all workers) that let me reconstruct all comms/computations? I can do this, but I was hoping someone would say "X possibly fails in Y way...", and that would provide insight before I do all of that instrumentation work. Ultimately, I think the heart of my questions have to do with the way workers might silently die (if they can), and whether it is possible that distributed channels can silently drop data without my queen process knowing. (The latter would seem like an awfully big thing, and so I sincerely doubt that's the issue.) Those are two big holes in my mental map of how distributed places work, and I've spent enough time cleaning up/refactoring/simplifying things that I thought I would ask the community at this point. I'm bothered because the DB work and calculations are all so deterministic. I've only parallelized this because of the volume of data that ultimately needs to be processed... Cheers, Matt -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.