Thanks Fei. There weren’t SSL/TLS sessions in our environment but I do feel some of memory are being held by ‘dormant’ sessions. The total of amount of memory held by freelist (44G) was however surprisingly high. Majority of that (99%) are allocated through and held by ioBufAllocator. I am wondering if there is anyway to limit the size of these freelists, also curious what caused the ‘Allocated’ to continue to go up and why the ‘In-Use’ did not go to zero after user traffic stops (and all of the keep-alive session times out).
I am also puzzled by the line memory/RamCacheLRUEntry shows only 5.2M, where traffic_top shows about 32GB ram cache used. Thanks, -Hongfei > On Dec 17, 2020, at 3:32 PM, Fei Deng <duke8...@apache.org> wrote: > > Not saying this is the exact cause, but we've seen similar behavior > previously. The reason for our issue was the session cache size was set to > a size too big compared to the ram size, and since the sessions stored in > the cache are only removed when the cache is full, and inserting new > sessions caused it to trigger *removeOldestSession*. You might want to > check your configurations related to this feature > *proxy.config.ssl.session_cache.size*. > > On Thu, Dec 17, 2020 at 1:52 PM Hongfei Zhang <hongfei...@gmail.com> wrote: > >> Hi Folks, >> >> Based on information provided >> https://docs.trafficserver.apache.org/en/8.1.x/admin-guide/performance/index.en.html#memory-allocation >> and >> with a fixed ram_cache.size setting (32GB), we expected the memory usage to >> be plateaued a couple of days usage. This is not however what we saw in >> multiple production environments. It seemed the memory usage increases >> steadily overtime, abeilt as a slow pace once the system’s memory usage >> reaches 80-85% (there aren’t many other processes running on the system), >> until to a point ATS process is killed by kernel (oom kill) or human >> intervention (server restart). On a system with 192GB ram (32GB used for >> RAM disk, and ATS configured to use up to 32GB ram cache), peaking >> streaming throughput at 10Gbps, ATS has to be killed/restared in about 2 >> weeks. At peak hours, there are about 5k-6k client connections and less >> than 1k upstream connections (to mid tier caches). >> >> We did some analysis on the Freelist dump (kill -USR1 pid) output (an >> example is attached) and found the allocated in ioBufAllocator[0-14] slots >> appeared to be main contributor to the total and also likely to be the >> source of the increase overtime. >> >> In terms of configurations and plugin usage, in addition to ram_cache >> setting to 32GB, we also >> changed proxy.config.http.default_buffer_water_mark INT 15000000 (from >> default 64k) to allow the entire video segment to be buffered on the >> upstream connection to avoid client starvation issue when the first client >> comes from a slow draining link, and >> proxy.config.cache.target_fragment_size INT 4096 to allow upstream chunked >> responses to be written into cache storage timely. There is no connection >> limits (# of connections appeared to be always in the normal range). The >> inactivity timeout values are fairly low (<120 secs).The only plugin we >> used is header_rewrite.so. No https, no http/2. >> >> I would appreciate if someone can shed some lights on how to further track >> this down, and any practical tips for short term mitigation. In particular: >> 1. Inside HttpSM, which states require allocate/re-use ioBuf? Is there a >> way to put a ceiling on each slot or total allocation? >> 2. Is the ioBufAllocation ceiling a function of total connections in which >> case I should set a connection limit? >> 3. The memory/RamCacheLRUEntry shows 5.2M, how is this related to the >> actual ram_cache usage reported by traffic_top (32GB used)? >> 4. At the point of the freelist dump, ATS process size was 78GB, the >> freelist total showed about 44GB, with 32GB ram_cache used (traffic_top >> reports). Assuming these two number are not overlapping, I also know the >> in-memory (disk) directory entry cache takes at least 10GB, then these >> numbers do add up. 44+32+10 >> 78. What am I missing? >> >> >> Thanks, >> -Hongfei >>