On 01/02/2014 10:46 PM, Herndon, John Luke wrote:
On 1/2/14, 11:36 AM, "Gordon Sim" <g...@redhat.com> wrote:
On 12/20/2013 09:26 PM, Herndon, John Luke wrote:
On Dec 20, 2013, at 12:13 PM, Gordon Sim <g...@redhat.com> wrote:
On 12/20/2013 05:27 PM, Herndon, John Luke wrote:
Other protocols may support bulk consumption. My one concern with
this approach is error handling. Currently the executors treat
each notification individually. So let¹s say the broker hands
100 messages at a time. When client is done processing the
messages, the broker needs to know if message 25 had an error or
not. We would somehow need to communicate back to the broker
which messages failed. I think this may take some refactoring of
executors/dispatchers. What do you think?
[...]
(2) What would you want the broker to do with the failed messages?
What sort of things might fail? Is it related to the message
content itself? Or is it failures suspected to be of a temporal
nature?
There will be situations where the message can¹t be parsed, and those
messages can¹t just be thrown away. My current thought is that
ceilometer could provide some sort of mechanism for sending messages
that are invalid to an external data store (like a file, or a
different topic on the amqp server) where a living, breathing human
can look at them and try to parse out any meaningful information.
Right, in those cases simply requeueing probably is not the right thing
and you really want it dead-lettered in some way. I guess the first
question is whether that is part of the notification systems function,
or if it is done by the application itself (e.g. by storing it or
republishing it). If it is the latter you may not need any explicit
negative acknowledgement.
Exactly, I¹m thinking this is something we¹d build into ceilometer and not
oslo, since ceilometer is where the event parsing knowledge lives. From an
oslo point of view, the message would be 'acked¹.
Other errors might be ³database not available², in which case
re-queing the message is probably the right way to go.
That does mean however that the backlog of messages starts to grow on
the broker, so some scheme for dealing with this if the database outage
goes on for a bit is probably important. It also means that the messages
will keep being retried without any 'backoff' waiting for the database
to be restored which could increase the load.
This is a problem we already have :(
Agreed, it is a property of reliable (i.e. acknowledged) transfer from
the broker, rather than batching. And of course, some degree of
buffering here is exactly what message queues are supposed to provide.
The point is simply to provide some way of configuring things so that
this can be bounded, or prevented from taking down the entire broker.
(And perhaps some way of altering the unfortunate someone!)
https://github.com/openstack/ceilometer/blob/master/ceilometer/notification
.py#L156-L158
Since notifications cannot be lost, overflow needs to be detected and the
messages need to be saved. I¹m thinking the database being down is a rare
occurrence that will be worthy of waking someone up in the middle of the
night. One possible solution: flip the collector into an emergency mode
and save notifications to disc until the issue is resolved. Once the db is
up and running, the collector inserts all of these saved messages (as one
big batch!). Thoughts?
I¹m not sure I understand what you are saying about retrying without a
backoff. Can you explain?
I mean that if the messages are explicitly requeued and the original
subscription is still active, they will be immediately redelivered and
will thus keep cycling from broker to client, back to broker, back to
client etc etc until the database is available again.
Pulling messages off continually like this without actually being able
to dequeue them may reduce the brokers effectiveness at e.g. paging out,
and in any event involves some unnecessary load on top of the expanding
queue.
It might be better, just as an example, to abort the connection to the
broker (implicitly requeueing all unacked messages), and only reconnect
when the database becomes available (and that can be tried after 1
second, then 2, then 4 etc up to some maximum retry interval).
Or another alternative would be to leave the connection to the broker,
but by not requeing or acking ensure that once the prefetch has been
reached, no further messages will be delivered. Then locally, on the
client, retry the processing for the prefetched messages until the
database is back again.
The basic point I'm trying to make is that it seems to me there is
little value in simply handing the messages back to the broker for
immediate redelivery back to the client. It delays the retry certainly,
but at unnecessary expense.
More generally I wonder whether an explicit negative acknowledgement is
actually needed in the notify API at all. If it isn't, that may simplify
things for batching.
_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev