Hi Folks, In a 2-tier CDN production environment using ATS, we recently had a series of issues that I want to reach out to the community for help.
The symptoms usually shows up on edge caches as: 1) elevated CPU - noticeable, but not 'through the roof' 2) unusually high ingress bw, often order of magnitudes higher than egress. i.e. edge caches is pulling a lot of data from mids but not sending much to end clients. We often saw 3-4Gbps ingress vs. 500Mbps egress. The edge cache uses separate network interfaces - the ingress traffic is mainly from upstream interface. 3) huge number of TIME_WAIT sockets on edge cache, 70-80K, all of them destined to mid. This number stays at that level until we kill/restart ATS process. It looks like the edge is open/close sockets to mid, as indicated by the debug log. 4) debug log shows SM 702933367 stuck in the loop repeating the following, over and over and over. The regular transaction log, however, does not have an entry for the url this SM is handling. So it feels like the the stuck loop runs in the ‘background’, GET the same url, without caching or sending to client. ... [Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN [Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open connection to x.x.x.x 172.17.173.103:80 [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, NET_EVENT_OPEN] [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_http_server_open, NET_EVENT_OPEN] [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0 [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE] [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE] [Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE] [Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE] [Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_READ_READY] [Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY] [Sep 22 03:54:23.227] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60 [Sep 22 03:54:23.231] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE] [Sep 22 03:54:23.232] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE] [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HttpTransact::HandleResponse] [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response received [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HandleResponse] response_received_time: 1600746863 [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [is_response_valid] No errors in response [Sep 22 03:54:23.234] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response valid [Sep 22 03:54:23.235] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] [initialize_state_variables_from_response]Server is keep-alive. [Sep 22 03:54:23.236] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN [Sep 22 03:54:23.240] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open connection to x.x.x.x 172.17.173.103:80 [Sep 22 03:54:23.241] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, NET_EVENT_OPEN] [Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_http_server_open, NET_EVENT_OPEN] [Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0 [Sep 22 03:54:23.243] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE] [Sep 22 03:54:23.244] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE] [Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE] [Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE] [Sep 22 03:54:23.285] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [HttpSM::main_handler, VC_EVENT_READ_READY] [Sep 22 03:54:23.286] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY] [Sep 22 03:54:23.288] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60 [Sep 22 03:54:23.289] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callback, HTTP_API_CONTINUE] [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] [&HttpSM::state_api_callout, HTTP_API_CONTINUE] [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HttpTransact::HandleResponse] [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response received [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [HandleResponse] response_received_time: 1600746863 [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] [is_response_valid] No errors in response [Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] [HttpTransact::HandleResponse] Response valid [Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] [initialize_state_variables_from_response]Server is keep-alive. … 5) the transaction log on the mid confirms there were request storms for the same url from a the edge, but all resulted in crc=ERR_CLIENT_ABORT, cfsc=INTR. Our initial investigation confirmed the this traffic storm situation usually follows a repeated series of parent proxy (mid tier cache) loss and recovery events - recovery is usually 10 seconds after initial mark down, and the loss/recover repeats 3 or 4 times in about a minute. The ATS is 7.1.5 based, process running on bare metal server with Centos 7.7. The only way to recover today is to restart ATS process. Obviously this is service impacting. Was this problem reported before? Any verdict on root cause? Debugging tip or pointers to resolution? Thanks, -Hongfei