Yes, the first thing that helped us with connection issues was setting the origin_max_connections to a reasonable value, either per cache or per remap depending on traffic but we would always have a cache upper limit. However in a 2-tier setup if you only do this at one tier you back up the connections to another tier so it has to be done at both tiers. Even with that though you can still end up with the close wait issues with connections hanging around which is where the patch came in because that addresses that tight spinning loop you were seeing
On Wed, Sep 30, 2020 at 1:43 PM Hongfei Zhang <hongfei...@gmail.com> wrote: > > Thank Evan. In our case, the edge’s > origin_max_connections: 0 > net.connection_throttle:500000 > parent_proxy.total_connect_attemps:6 > parent_proxy.per_parent_connec_attempts:1 > > Do you think we are subject to the looping condition this commit addresses? > > -Hongfei > > > On Sep 30, 2020, at 2:37 PM, Evan Zelkowitz <e...@apache.org> wrote: > > > > This sounds similar to issues we have had in the past. I think this > > was fixed in 7x with this commit, > > https://github.com/apache/trafficserver/commit/1da4fbcd5fd32ff86888ed401e0c297b22e00c52 > > . There have been a few other commits past that for other tight loop > > scenarios when dealing with the connection limit in ATS as well so I'd > > recommend moving to at least 7.1.8 > > > > On Wed, Sep 30, 2020 at 12:07 PM Hongfei Zhang <hongfei...@gmail.com> wrote: > >> > >> Hi Folks, > >> > >> In a 2-tier CDN production environment using ATS, we recently had a series > >> of issues that I want to reach out to the community for help. > >> > >> The symptoms usually shows up on edge caches as: > >> 1) elevated CPU - noticeable, but not 'through the roof' > >> 2) unusually high ingress bw, often order of magnitudes higher than > >> egress. i.e. edge caches is pulling a lot of data from mids but not > >> sending much to end clients. We often saw 3-4Gbps ingress vs. 500Mbps > >> egress. The edge cache uses separate network interfaces - the ingress > >> traffic is mainly from upstream interface. > >> 3) huge number of TIME_WAIT sockets on edge cache, 70-80K, all of > >> them destined to mid. This number stays at that level until we > >> kill/restart ATS process. It looks like the edge is open/close sockets to > >> mid, as indicated by the debug log. > >> 4) debug log shows SM 702933367 stuck in the loop repeating the > >> following, over and over and over. The regular transaction log, however, > >> does not have an entry for the url this SM is handling. So it feels like > >> the the stuck loop runs in the ‘background’, GET the same url, without > >> caching or sending to client. > >> ... > >> [Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> > >> SM_ACTION_ORIGIN_SERVER_OPEN > >> [Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> open connection to x.x.x.x 172.17.173.103:80 > >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [HttpSM::main_handler, NET_EVENT_OPEN] > >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_http_server_open, NET_EVENT_OPEN] > >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0 > >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_api_callback, HTTP_API_CONTINUE] > >> [Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_api_callout, HTTP_API_CONTINUE] > >> [Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE] > >> [Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE] > >> [Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [HttpSM::main_handler, VC_EVENT_READ_READY] > >> [Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY] > >> [Sep 22 03:54:23.227] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook > >> 0x2b7e844b4a60 > >> [Sep 22 03:54:23.231] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_api_callback, HTTP_API_CONTINUE] > >> [Sep 22 03:54:23.232] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_api_callout, HTTP_API_CONTINUE] > >> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) > >> [702933367] [HttpTransact::HandleResponse] > >> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_seq) > >> [702933367] [HttpTransact::HandleResponse] Response received > >> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) > >> [702933367] [HandleResponse] response_received_time: 1600746863 > >> [Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) > >> [702933367] [is_response_valid] No errors in response > >> [Sep 22 03:54:23.234] Server {0x2b7e51d2d700} DEBUG: (http_seq) > >> [702933367] [HttpTransact::HandleResponse] Response valid > >> [Sep 22 03:54:23.235] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) > >> [702933367] [initialize_state_variables_from_response]Server is keep-alive. > >> [Sep 22 03:54:23.236] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> State Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> > >> SM_ACTION_ORIGIN_SERVER_OPEN > >> [Sep 22 03:54:23.240] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> open connection to x.x.x.x 172.17.173.103:80 > >> [Sep 22 03:54:23.241] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [HttpSM::main_handler, NET_EVENT_OPEN] > >> [Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_http_server_open, NET_EVENT_OPEN] > >> [Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> calling plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0 > >> [Sep 22 03:54:23.243] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_api_callback, HTTP_API_CONTINUE] > >> [Sep 22 03:54:23.244] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_api_callout, HTTP_API_CONTINUE] > >> [Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE] > >> [Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE] > >> [Sep 22 03:54:23.285] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [HttpSM::main_handler, VC_EVENT_READ_READY] > >> [Sep 22 03:54:23.286] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY] > >> [Sep 22 03:54:23.288] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> calling plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook > >> 0x2b7e844b4a60 > >> [Sep 22 03:54:23.289] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_api_callback, HTTP_API_CONTINUE] > >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] > >> [&HttpSM::state_api_callout, HTTP_API_CONTINUE] > >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) > >> [702933367] [HttpTransact::HandleResponse] > >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_seq) > >> [702933367] [HttpTransact::HandleResponse] Response received > >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) > >> [702933367] [HandleResponse] response_received_time: 1600746863 > >> [Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) > >> [702933367] [is_response_valid] No errors in response > >> [Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_seq) > >> [702933367] [HttpTransact::HandleResponse] Response valid > >> [Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) > >> [702933367] [initialize_state_variables_from_response]Server is keep-alive. > >> … > >> 5) the transaction log on the mid confirms there were request > >> storms for the same url from a the edge, but all resulted in > >> crc=ERR_CLIENT_ABORT, cfsc=INTR. > >> > >> > >> Our initial investigation confirmed the this traffic storm situation > >> usually follows a repeated series of parent proxy (mid tier cache) loss > >> and recovery events - recovery is usually 10 seconds after initial mark > >> down, and the loss/recover repeats 3 or 4 times in about a minute. > >> > >> The ATS is 7.1.5 based, process running on bare metal server with Centos > >> 7.7. The only way to recover today is to restart ATS process. Obviously > >> this is service impacting. > >> > >> Was this problem reported before? Any verdict on root cause? Debugging tip > >> or pointers to resolution? > >> > >> Thanks, > >> -Hongfei >