I think I'd call that Option #2; you don't actually care about reprocessing each and every one of those messages because you need its content, you just need to make sure you fix whatever the bug was (and yes, clearly you need to test the code change, but it doesn't have to be with exactly the data you got in the originally-failing message).
Given the long lead time I'd expect to see here (a message fails, then a person investigates, then a developer writes and tests code, and eventually the code gets deployed), it sounds like it could be weeks before you'd be ready to reprocess one of these messages if you kept them around for that purpose. You don't want that. I'd just log the full body of the message (either to your usual log file or to a separate debugging log file) and let it disappear, and then figure out how to make one with similar content when it came time to test. You don't need it written to a database as long as you can rebuild a test message from the log (which just means you need to do a good job of choosing how to serialize the message to the log), and using a log instead of a database eliminates a number of potential failure cases. Tim On Sun, Mar 1, 2015 at 6:36 PM, Rural Hunter <ruralhun...@gmail.com> wrote: > Option 3 is the best fit for my use case. It's a developing system so I > need someone to fix the code or another system if it fails. I need to > reporocess those failed messages to verify the fix. I considered to store > those messages in db earlier but the db storing process could be one of the > original failure causes. So it's still not a reliable option. For now, I > think I will process those messages in a much more timly manner if there is > no other better choice. > > 2015-03-02 0:33 GMT+08:00 Tim Bain <tb...@alumni.duke.edu>: > > > So what do you plan to do with those failed messages? I see three > > possibilities, but have no idea which of them is the one (or ones) that > > apply to your situation. > > > > 1. Are you going to reprocess them as-is because whatever failed is > > assumed to just be some transient failure that won't fail the second > > time > > you try it? If so, why not just do that automatically a few minutes > > later > > so you can delete the message when you're done? Why wait days? > > 2. Are you going to ignore the message, but debug what it was about > that > > message that caused the failure so you can fix your code so it doesn't > > error out the next time? If so, log whatever you need (possibly the > > full > > message text, if that's what you need) to a file that you can go back > to > > whenever you get around to it, and delete the message immediately. > > 3. Are you going to reprocess the message, but only after a person > > manually fixes something that caused the failure the first time > around. > > This is the hardest one, where you actually care about the message > > itself > > but you can't do anything about it till a person (who's only working > > certain hours on certain days) can do something first. Here you > really > > need to put the message into a database (and ActiveMQ is not a > > database) to > > be held until it's ready to be acted upon (e.g. by making a message > > that's > > equivalent to the one that failed and resending it). You could say > that > > storing your original messages in a queue that's backed by a different > > KahaDB meets this stored-in-a-database requirement, but databases > allow > > random access (so you could reprocess any message first, then any > other > > one > > second, whenever someone has fixed whatever was wrong with each one in > > any > > order) and queues do not, so I don't think it's a very good solution. > > Better to store the relevant content of the message in a real > database, > > fix > > the problem, and then generate a new message from it that you inject > > into > > the real queue for processing as usual. > > > > Your goal in all cases should be for messages to get processed quickly > > (whether that means seconds, minutes, or maybe hours), whether that means > > "acted upon as intended", "discarded to the DLQ because I don't care > about > > them if they fail", or "stored somewhere to be dealt with later because I > > actually care about the ones that fail". > > > > Tim > > > > On Sun, Mar 1, 2015 at 8:23 AM, Rural Hunter <ruralhun...@gmail.com> > > wrote: > > > > > I know it is the best way that I consume the messages timely for this > > > queue. It's just the queue I cares about the failed messages so I could > > > processthem timely. What about other queues if I don't care much about > > the > > > failed messages and they go to the real dead letter queues in the same > > > manner? > > > > > > 2015-03-01 0:37 GMT+08:00 Tim Bain <tb...@alumni.duke.edu>: > > > > > > > KahaDB can only delete a data file when every message in it can be > > > > deleted. If a single message is still needed, the whole file must be > > > kept, > > > > and it doesn't perform any sort of compaction. And if the last > message > > > in > > > > a file that must be kept (because of some other message) has an ack > in > > > the > > > > next file, that next file must be kept itself. This can > theoretically > > > > repeat forever if the messages happen to lay out just right in the > data > > > > files, so a single old unprocessed message can theoretically prevent > > > KahaDB > > > > from deleting any of its data files. There was a recently-fixed bug > > > where > > > > the file with the ack was being improperly deleted, resulting in > > > redelivery > > > > of the acked messages on broker restart; see > > > > https://issues.apache.org/jira/browse/AMQ-5542, which is fixed in > > 5.11.1 > > > > and 5.12.0. So the version you're running won't recognize the chain > of > > > > files (if any) that need to be kept; with that fix, I'd expect you to > > hit > > > > your limit even faster. > > > > > > > > So your DLQish messages are in fact keeping alive any data files in > > which > > > > they exist. If they all came in as a batch, that would be just one > > file, > > > > but since they're spread out over time, that's probably a decent > number > > > of > > > > files. > > > > > > > > So you could do as Tim suggested and make a separate KahaDB store for > > > these > > > > long-lived messages; that would solve this problem, but it's > > ultimately a > > > > workaround. Shrinking the size of each data file would help right > now, > > > but > > > > once you upgrade to 5.11.1 or 5.12.0 it wouldn't be able to guarantee > > > that > > > > you didn't have to keep all the files; I'd focus on other options. > > > > > > > > So the real question is, why are you keeping your DLQ-like messages > > for 5 > > > > days? (This is probably the point where Art will chime in with > > "ActiveMQ > > > > is not a database.") You should be doing something with those > messages > > > > quickly, not keeping them around for ages. If the messages get > > consumed > > > > immediately, the KahaDB files won't stick around long, and your > problem > > > is > > > > solved. So figure out how to change your application logic so you > > don't > > > > rely on messages staying on the broker for days; anything else is > just > > a > > > > workaround for this flaw in your application logic. > > > > > > > > Tim > > > > One more question: will the same thing happen if I switch to leveldb? > > > > > > > > 2015-02-28 22:53 GMT+08:00 Rural Hunter <ruralhun...@gmail.com>: > > > > > > > > > I'm sorry I made a mistake. My storage is kahadb. We switched from > > > > leveldb > > > > > to kahadb a while ago and I forgot that. > > > > > Thanks for the links. Now understand what happened! > > > > > > > > > > 2015-02-28 19:03 GMT+08:00 Tim Robbins <tim.robb...@outlook.com>: > > > > > > > > > >> Hi, > > > > >> > > > > >> Two suggestions for you: > > > > >> > > > > >> 1. Try decreasing the logSize parameter for LevelDB. You’ve have a > > > > >> greater number of smaller log files, and a greater chance of each > > log > > > > file > > > > >> being garbage-collected. > > > > >> 2. With KahaDB, it’s possible to configure multiple KahaDB stores, > > and > > > > to > > > > >> put your dead-letter type messages into a different store than > > > > everything > > > > >> else to reduce overhead: > > > > >> > > > > > > > > > > http://blog.garytully.com/2011/11/activemq-multiple-kahadb-instances.html > > > > >> < > > > > >> > > > > > > > > > > http://blog.garytully.com/2011/11/activemq-multiple-kahadb-instances.html > > > > >> > > > > > >> Unfortunately it doesn’t appear that this applies to LevelDB yet! > > > > >> > > > > >> Regards, > > > > >> > > > > >> Tim > > > > >> > > > > >> > On 28 Feb 2015, at 7:27 pm, Rural Hunter <ruralhun...@gmail.com > > > > > > wrote: > > > > >> > > > > > >> > Hi, > > > > >> > > > > > >> > Activemq version 5.10.2, storage: leveldb. > > > > >> > > > > > >> > I have a queue which serves similiar function as dead letter > > queue. > > > My > > > > >> > application process messages from another queue and if the > > > processing > > > > >> > fails, it put the message into this queue. The messages are > > > persistent > > > > >> and > > > > >> > average several KBs in size. My application processes many > > messages > > > > but > > > > >> the > > > > >> > failed message count is very small, less than 100 a day. I > noticed > > > > after > > > > >> > the application running for several days, my activemq storage > > > becomes > > > > >> > almost full. I configured the storage to 30G. I checked the > normal > > > > >> queues > > > > >> > and topics and there is no queue with large count of message. > Most > > > of > > > > >> them > > > > >> > are empty and some have only several messages. Only the failure > > > > message > > > > >> > queue I metioned above has a few hundred messages(about 500) > which > > > are > > > > >> > accumulated in several days. > > > > >> > > > > > >> > I have no idea what takes so much storage. I checked the storage > > > files > > > > >> and > > > > >> > found there are many db-xxxx.log with timestamp almost through > the > > > > >> several > > > > >> > days. They are not consequent though. Some of the db-xxx.log > files > > > are > > > > >> not > > > > >> > there. So the file list is like this: > > > > >> > db-1000.log > > > > >> > db-1001.log > > > > >> > db-1003.log > > > > >> > db-1004.log > > > > >> > db-1005.log > > > > >> > db-1008.log > > > > >> > ... > > > > >> > I suspect the failed messages are in those db-xxx.log files so I > > > just > > > > >> tried > > > > >> > to clear the failed message queue. Right after that I found > those > > > old > > > > >> > db-xxx.log disappeared and the storage usage went back to 2%. So > > it > > > > >> seems > > > > >> > clear that the about 500 failed messages took around 30G > storage. > > > But > > > > >> how > > > > >> > can it be? Those messages are very small in size and the total > > size > > > of > > > > >> the > > > > >> > messsage should be no more than a few MBs. > > > > >> > > > > >> > > > > > > > > > > > > > > >