Hi Folks,

In a 2-tier CDN production environment using ATS, we recently had a series of 
issues that I want to reach out to the community for help. 

The symptoms usually shows up  on edge caches as:
        1) elevated CPU - noticeable, but not 'through the roof'
        2) unusually high ingress bw, often order of magnitudes higher than 
egress. i.e. edge caches is pulling a lot of data from mids but not sending 
much to end clients. We often saw 3-4Gbps ingress vs. 500Mbps egress. The edge 
cache uses separate network interfaces - the ingress traffic is mainly from 
upstream interface. 
        3) huge number of TIME_WAIT sockets on edge cache, 70-80K, all of them 
destined to mid. This number stays at that level until we kill/restart ATS 
process. It looks like the edge is open/close sockets to mid, as indicated by 
the debug log. 
        4) debug log shows SM 702933367 stuck in the loop repeating the 
following, over and over and over. The regular transaction log, however, does 
not have an entry for the url this SM is handling.  So it feels like the the 
stuck loop runs in the ‘background’, GET the same url, without caching or 
sending to client. 
...
[Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State 
Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN
[Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open 
connection to x.x.x.x 172.17.173.103:80
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[HttpSM::main_handler, NET_EVENT_OPEN]
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_http_server_open, NET_EVENT_OPEN]
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling 
plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_api_callback, HTTP_API_CONTINUE]
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_api_callout, HTTP_API_CONTINUE]
[Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE]
[Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE]
[Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[HttpSM::main_handler, VC_EVENT_READ_READY]
[Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY]
[Sep 22 03:54:23.227] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling 
plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60
[Sep 22 03:54:23.231] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_api_callback, HTTP_API_CONTINUE]
[Sep 22 03:54:23.232] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_api_callout, HTTP_API_CONTINUE]
[Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] 
[HttpTransact::HandleResponse]
[Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] 
[HttpTransact::HandleResponse] Response received
[Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] 
[HandleResponse] response_received_time: 1600746863
[Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] 
[is_response_valid] No errors in response
[Sep 22 03:54:23.234] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] 
[HttpTransact::HandleResponse] Response valid
[Sep 22 03:54:23.235] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] 
[initialize_state_variables_from_response]Server is keep-alive.
[Sep 22 03:54:23.236] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State 
Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN
[Sep 22 03:54:23.240] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open 
connection to x.x.x.x 172.17.173.103:80
[Sep 22 03:54:23.241] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[HttpSM::main_handler, NET_EVENT_OPEN]
[Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_http_server_open, NET_EVENT_OPEN]
[Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling 
plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0
[Sep 22 03:54:23.243] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_api_callback, HTTP_API_CONTINUE]
[Sep 22 03:54:23.244] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_api_callout, HTTP_API_CONTINUE]
[Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE]
[Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE]
[Sep 22 03:54:23.285] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[HttpSM::main_handler, VC_EVENT_READ_READY]
[Sep 22 03:54:23.286] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY]
[Sep 22 03:54:23.288] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling 
plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60
[Sep 22 03:54:23.289] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_api_callback, HTTP_API_CONTINUE]
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] 
[&HttpSM::state_api_callout, HTTP_API_CONTINUE]
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] 
[HttpTransact::HandleResponse]
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] 
[HttpTransact::HandleResponse] Response received
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] 
[HandleResponse] response_received_time: 1600746863
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367] 
[is_response_valid] No errors in response
[Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] 
[HttpTransact::HandleResponse] Response valid
[Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] 
[initialize_state_variables_from_response]Server is keep-alive.
…
        5) the transaction log on the mid confirms there were request storms 
for the same url from a the edge, but all resulted in crc=ERR_CLIENT_ABORT, 
cfsc=INTR. 


Our initial investigation confirmed the this traffic storm situation usually 
follows a repeated series of parent proxy (mid tier cache) loss and recovery 
events - recovery is usually 10 seconds after initial mark down, and the 
loss/recover repeats 3 or 4 times in about a minute. 

The ATS is 7.1.5 based, process running on bare metal server with Centos 7.7. 
The only way to recover today is to restart ATS process. Obviously this is 
service impacting.

Was this problem reported before? Any verdict on root cause? Debugging tip or 
pointers to resolution?

Thanks,
-Hongfei 

Reply via email to