I just caught up with this topic, and since there's no obvious one message to reply to, I just reply here.
I agree with Anthony we better get the error reporting protocol approximately right the first time, and rushing it could lead to tears later on. Mind, I said "approximately right", not "perfect". However, there are more than one way to screw this up. One is certainly to fall short of client requirements in a way that is costly to fix (think incompatible protocol revision). Another is to overshoot them in a way that is costly to maintain. A third one is to spend too much time on figuring out the perfect solution. I believe our true problem is that we're still confused and/or disagreeing on client requirements, and this has led to a design that tries to cover all the conceivable bases, and feels overengineered to me. There are only so many complexity credits you can spend in a program, both globally and individually. I'm very, very wary of making error reporting more complex than absolutely, desperately necessary. Reporting errors should remain as easy as we can make it, for reasons that have already been mentioned by me and others: * The more cumbersome it is to report an error, the less it is done, and the more vaguely it is done. If you have to edit more than the error site to report an error accurately, then chances skyrocket that it won't be reported, or it'll be reported inaccurately. And not because coders are stupid or lazy, but because they make sensible use of their very limited complexity credits: if you can either get the job done with lousy error messages, or not get it done at all, guess what the smart choice is. * It's highly desirable for errors to do double duty as documentation. Code like this (random pick) doesn't need a comment: if (qemu_opt_get(opts, "vlan")) { qemu_error("The 'vlan' parameter is not valid with -netdev\n"); return -1; } Bury the human-readable description in some far-away error list, and we lose that. Even wrapping it in enough error object construction boiler-plate can easily lose it. So, before we accept the cost of highly structured error objects, I'd like to see a convincing argument for their need. And actual client developers like Dan are in a much better position to make such an argument than server developers (like me) speculating about client needs. If I understand Dan correctly, machine-readable error code + human-readable description is just fine, as long as the error code is reasonably specific and the description is precise and complete. Have we heard anything else from client developers? I'm not a client developer, but let me make a few propositions on client needs anyway: * Clients are just as complexity-bound as the server. They prefer their errors as simple as possible, but no simpler. * The crucial question for the client isn't "what exactly went wrong". It is "how should I handle this error". Answering that question should be easy (say, check the error code). Figuring out what went wrong should still be possible for a human operator of the client. * Clients don't want to be tightly coupled to the server. * No matter how smart and up-to-date the client is, there will always be errors it doesn't know. And it needs to answer the "how to handle" question whether it knows the error code or not! That's why protocols like HTTP have simple rules to classify error codes. Likewise, it needs to be able to give a human operator enough information to figure out what went wrong whether it knows the error or not. How do you expect clients to format a structured error object for an error they don't know into human-readable text? Isn't it much easier and more robust to cut out the formatting middle-man and send the text along with the error? There's a general rule of programming that I've found quite hard to learn and quite painful to disobey: always try the stupidest solution that could possibly work first. Based on what I've learned about client requirements so far, I figure that solution is "easily classified error code + human-readable description". What if we go with that now, and later realize that it was *too* stupid, i.e. we need more structure after all? I believe that'll be only disastrous if we need an *incompatible* protocol revision to accomodate it. Can't we simply add a "data" member to the error object then?