Hello all, Recently we found that our samples are all lost. After some investigation, we found: 1, we are using prometheus agent to send all data to prometheus server by remote write 2, the agent sample sending code is in storage\remote\queue_manager.go, the function is sendWriteRequestWithBackoff() 3, inside the function, if attempt(the function where request is made to prometheus server) function returns an Recoverable Error, then it will retry sending the request 4, when a Recoverable error is returned? one scenario is the prometheus server returned 5xx error 5, I think not every 5xx error is recoverable, and there is no other way to exit the for loop in sendWriteRequestWithBackoff(). The agent keeps retrying but every time it receives an 5xx from the server. so we lost all samples for hours until we restart the agent
So my question is why 5xx from the promtheus server is considered Recoverable? And I believe there should be a way to exit the loop, for example a maximum times to retry. It seems that the agent mode is not mature enough to work in production. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/caa3c922-6638-40fc-95a8-95f2b6d4e36dn%40googlegroups.com.