FWIW I have seen real problems in production where long-running worker goroutines stopped working. We looked into it and found that certain rare requests were panicking, not releasing a mutex and thus preventing the long-running goroutine from acquiring that mutex.
This took ages to work out - made worse because I'd forgotten that the stdlib recovers from panics in HTTP requests by default... This is the kind of subtle problem that makes me think that recovering from panics as a way of making the system more reliable can actually lead to nastier problems further down the line. On 26 April 2017 at 10:38, 'Axel Wagner' via golang-nuts <golang-nuts@googlegroups.com> wrote: > On Wed, Apr 26, 2017 at 10:55 AM, Peter Herth <he...@peter-herth.de> wrote: >> >> >> >> On Wed, Apr 26, 2017 at 3:07 AM, Dave Cheney <d...@cheney.net> wrote: >>> >>> >>> >>> On Wednesday, 26 April 2017 10:57:58 UTC+10, Chris G wrote: >>>> >>>> I think those are all excellent things to do. They do not preclude the >>>> use of recovering from a panic to assist (emphasis on assist - it is >>>> certainly no silver bullet) in achieving fault tolerance. >>>> >>>> Assuming a web service that needs to be highly available, crashing the >>>> entire process due to one misbehaved goroutine is irresponsible. There can >>>> be thousands of other active requests in flight that could fail gracefully >>>> as well, or succeed at their task. >>>> >>>> In this scenario, I believe a well behaved program should >>>> >>>> clearly log all information about the fault >>> >>> panic does that >> >> >> No, panic certainly does not do that. It prints the stack trace. A proper >> logger could add additional information about the program state at the point >> of the panic, which is not visible from the stack trace. It also might at >> least be reasonable to perform an auto-save before quitting. >> >>> Same; relying on a malfunctioning program to report its failure is like >>> asking a sick human to perform their own surgery. >> >> >> What makes you think that a panic implies that the whole program is >> malfunctioning? > > > But that is not the claim. The claim is, that if you discover a condition > which can uniquely be attributed to a code bug, you should always err on the > side of safety and prefer bailing out to continuing with a known-bad > program. It's not "as I see this bug, I know the rest of the program is > broken too", it's "as I see this bug, I can not pretend that it can't be". > >> >> A panic should certainly taken seriously, and the computation in which it >> happened should be aborted. But if you think of a functional programming >> style > > > If you are thinking of that, then you are not thinking about go. Go has > shared state and mutable data. One of the major arguments here is, that > there is a level of isolation of state, which is very good, from all we > know, and that's the process; if the process dies, all locks are being > released, file descriptors closed and memory freed, so it gives a known-good > re-starting point. And that, in the presence of mutable state, potential > data races and code bugs, that is the correct layer of isolation to fall > back to. And I am also aware, that it's also not a perfect layer; you might > have already corrupted on-disk state or abused a protocol to corrupt some > state on the network. Those also need to be defended against, but process > isolation still gives a good tradeoff between efficiency, convenience and > safety. > > > FWIW, I don't believe there is any convincing to be done here on either > side. There are no technical arguments anymore; it is just that one set of > people are holding one belief and another set of people are holding another > belief. Both certainly do that based on technical arguments, but in the end, > they are simply weighing them differently. > > I mean, I definitely agree that it would be great for a program to never > crash. Or to have only panics which definitely can't be recovered from. Or > to have all state isolated and safely expungeable. I agree, that the process > being up for a larger timeslice is valuable and that other requests > shouldn't fail because one of them misbehaved. > > I also assume you agree that errors should be noticed, caught and fixed. I > assume you agree that crashing a binary will make the bug more noticeable. > That crashing would allow you to recover from a safer and better-known > state. And that being able to recover from any crash swiftly and > architecting a service so that processes dying doesn't take it down is > valuable and bugs shouldn't make it to production. > > The facts are straight, this is just a question of opinion and different > experiences; and I don't see any way out of it than saying "agree to > disagree; if you don't think you can tolerate panic's, you just can't use my > stuff and I won't use yours, if I consider it to hide failures or be > unergonomic". > This argument becomes much more difficult, when I'm having it with my > coworkers, as it does depend on how the service is run, which needs to be > decided by the team; in regards to this thread, at least we all have the > luxury that we can agree to disagree and move on :) > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.