On 2016-12-30 12:38:02 +0800 (+0800), Tom Fifield wrote: > This problem (legitimate users having all posts flatly rejected as > spam) is still happening. Any progress?
We're basically ready to retry the previous upgrade now that some issues have been identified/corrected by Marton. Change https://review.openstack.org/408657 to trigger it is waiting on https://review.openstack.org/416072 to grant him a shell account on the production server for improved reactivity if it fails again, so that he can collaborate with us more directly troubleshooting before we have to make the call to fix or revert. > I've been doing what I can without access to the server, but my > latest attempt - completely deleting an account so it could be > re-created by the affected user - was thwarted by a 500 error. Did > that appear in the logs? Given that I don't know the exact time you tried nor your IP address, and am unsure what that failure would look like in the logs aside from coming with a 500 error code this Apache access log entry 18 minutes before your E-mail's timestamp stands out: [30/Dec/2016:04:20:40 +0000] "GET /admin/askbot/post/25842/ HTTP/1.1" 500 917 "https://ask.openstack.org/admin/auth/user/2253/delete/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0" I don't see anything related around that timeframe in the Apache error log files, the Askbot application log, the Solr Jetty log... though there's this in dmesg which is suspiciously close (timestamps in dmesg aren't entirely reliable, so this could have happened 5 minutes earlier): [Fri Dec 30 04:25:25 2016] apache2 invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 And indeed, there's a spike in swap utilization around that time which, given the five-minute granularity could certainly support the notion that a runaway process ate all available virtual memory on the system: <URL: http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=2546&rra_id=2&view_type=&graph_start=1483029900&graph_end=1483156284&graph_height=120&graph_width=500&title_font_size=10 > The RAM usage graph suggests we were doing okay with a target utilization of ~50% so something eating an additional 6GiB of memory in a matter of a few minutes would definitely count as anomalous: <URL: http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=2544&rra_id=2&view_type=&graph_start=1483029900&graph_end=1483156284&graph_height=120&graph_width=500&title_font_size=10 > Was the memory spike caused by that deletion? Or was the deletion error coincidental and caused by a memory event which just happened to be going on at the same time? I have insufficient knowledge of the system to be able to say either way. It's possible there are additional logs I don't know to look at which could tell us. -- Jeremy Stanley _______________________________________________ OpenStack-Infra mailing list OpenStack-Infra@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra