Thank Evan. In our case, the edge’s origin_max_connections: 0 net.connection_throttle:500000 parent_proxy.total_connect_attemps:6 parent_proxy.per_parent_connec_attempts:1
Do you think we are subject to the looping condition this commit addresses? -Hongfei > On Sep 30, 2020, at 2:37 PM, Evan Zelkowitz <e...@apache.org> wrote: > > This sounds similar to issues we have had in the past. I think this > was fixed in 7x with this commit, > https://github.com/apache/trafficserver/commit/1da4fbcd5fd32ff86888ed401e0c297b22e00c52 > . There have been a few other commits past that for other tight loop > scenarios when dealing with the connection limit in ATS as well so I'd > recommend moving to at least 7.1.8 > > On Wed, Sep 30, 2020 at 12:07 PM Hongfei Zhang <hongfei...@gmail.com> wrote: >> >> Hi Folks, >> >> In a 2-tier CDN production environment using ATS, we recently had a series >> of issues that I want to reach out to the community for help. >> >> The symptoms usually shows up on edge caches as: >> 1) elevated CPU - noticeable, but not 'through the roof' >> 2) unusually high ingress bw, often order of magnitudes higher than >> egress. i.e. edge caches is pulling a lot of data from mids but not sending >> much to end clients. We often saw 3-4Gbps ingress vs. 500Mbps egress. The >> edge cache uses separate network interfaces - the ingress traffic is mainly >> from upstream interface. >> 3) huge number of TIME_WAIT sockets on edge cache, 70-80K, all of >> them destined to mid. This number stays at that level until we kill/restart >> ATS process. It looks like the edge is open/close sockets to mid, as >> indicated by the debug log. >> 4) debug log shows SM 702933367 stuck in the loop repeating the >> following, over and over and over. The regular transaction log, however, >> does not have an entry for the url this SM is handling. So it feels like >> the the stuck loop runs in the ‘background’, GET the same url, without >> caching or sending to client. >> ... >> [Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> >> SM_ACTION_ORIGIN_SERVER_OPEN >> [Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open >> connection to x.x.x.x 172.17.173.103:80 >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [HttpSM::main_handler, NET_EVENT_OPEN] >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_http_server_open, NET_EVENT_OPEN] >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0 >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_api_callback, HTTP_API_CONTINUE] >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_api_callout, HTTP_API_CONTINUE] >> [Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE] >> [Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE] >> [Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [HttpSM::main_handler, VC_EVENT_READ_READY] >> [Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY] >> [Sep 22 03:54:23.227] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60 >> [Sep 22 03:54:23.231] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_api_callback, HTTP_API_CONTINUE] >> [Sep 22 03:54:23.232] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_api_callout, HTTP_API_CONTINUE] >> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) >> [702933367] [HttpTransact::HandleResponse] >> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] >> [HttpTransact::HandleResponse] Response received >> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) >> [702933367] [HandleResponse] response_received_time: 1600746863 >> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) >> [702933367] [is_response_valid] No errors in response >> [Sep 22 03:54:23.234] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] >> [HttpTransact::HandleResponse] Response valid >> [Sep 22 03:54:23.235] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] >> [initialize_state_variables_from_response]Server is keep-alive. >> [Sep 22 03:54:23.236] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> >> SM_ACTION_ORIGIN_SERVER_OPEN >> [Sep 22 03:54:23.240] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open >> connection to x.x.x.x 172.17.173.103:80 >> [Sep 22 03:54:23.241] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [HttpSM::main_handler, NET_EVENT_OPEN] >> [Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_http_server_open, NET_EVENT_OPEN] >> [Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0 >> [Sep 22 03:54:23.243] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_api_callback, HTTP_API_CONTINUE] >> [Sep 22 03:54:23.244] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_api_callout, HTTP_API_CONTINUE] >> [Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE] >> [Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE] >> [Sep 22 03:54:23.285] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [HttpSM::main_handler, VC_EVENT_READ_READY] >> [Sep 22 03:54:23.286] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY] >> [Sep 22 03:54:23.288] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60 >> [Sep 22 03:54:23.289] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_api_callback, HTTP_API_CONTINUE] >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] >> [&HttpSM::state_api_callout, HTTP_API_CONTINUE] >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) >> [702933367] [HttpTransact::HandleResponse] >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] >> [HttpTransact::HandleResponse] Response received >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) >> [702933367] [HandleResponse] response_received_time: 1600746863 >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) >> [702933367] [is_response_valid] No errors in response >> [Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367] >> [HttpTransact::HandleResponse] Response valid >> [Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367] >> [initialize_state_variables_from_response]Server is keep-alive. >> … >> 5) the transaction log on the mid confirms there were request storms >> for the same url from a the edge, but all resulted in crc=ERR_CLIENT_ABORT, >> cfsc=INTR. >> >> >> Our initial investigation confirmed the this traffic storm situation usually >> follows a repeated series of parent proxy (mid tier cache) loss and recovery >> events - recovery is usually 10 seconds after initial mark down, and the >> loss/recover repeats 3 or 4 times in about a minute. >> >> The ATS is 7.1.5 based, process running on bare metal server with Centos >> 7.7. The only way to recover today is to restart ATS process. Obviously this >> is service impacting. >> >> Was this problem reported before? Any verdict on root cause? Debugging tip >> or pointers to resolution? >> >> Thanks, >> -Hongfei