Thanks Fei.  There weren’t SSL/TLS sessions in our environment but I do feel 
some of memory are being held by ‘dormant’ sessions. The total of amount of 
memory held by freelist (44G) was however surprisingly high. Majority of that 
(99%) are allocated through and held by ioBufAllocator. I am wondering if there 
is anyway to limit the size of these freelists, also curious what caused the 
‘Allocated’ to continue to go up and why the ‘In-Use’ did not go to zero after 
user traffic stops (and all of the keep-alive session times out). 

I am also puzzled by the line memory/RamCacheLRUEntry shows only 5.2M, where 
traffic_top shows about 32GB ram cache used. 


Thanks,
-Hongfei

> On Dec 17, 2020, at 3:32 PM, Fei Deng <duke8...@apache.org> wrote:
> 
> Not saying this is the exact cause, but we've seen similar behavior
> previously. The reason for our issue was the session cache size was set to
> a size too big compared to the ram size, and since the sessions stored in
> the cache are only removed when the cache is full, and inserting new
> sessions caused it to trigger *removeOldestSession*. You might want to
> check your configurations related to this feature
> *proxy.config.ssl.session_cache.size*.
> 
> On Thu, Dec 17, 2020 at 1:52 PM Hongfei Zhang <hongfei...@gmail.com> wrote:
> 
>> Hi Folks,
>> 
>> Based on information provided
>> https://docs.trafficserver.apache.org/en/8.1.x/admin-guide/performance/index.en.html#memory-allocation
>>  and
>> with a fixed ram_cache.size setting (32GB), we expected the memory usage to
>> be plateaued a couple of days usage.    This is not however what we saw in
>> multiple production environments. It seemed the memory usage increases
>> steadily overtime, abeilt as a slow pace once the system’s memory usage
>> reaches 80-85% (there aren’t many other processes running on the system),
>> until to a point ATS process is killed by kernel (oom kill) or human
>> intervention (server restart). On a system with 192GB ram (32GB used for
>> RAM disk, and ATS configured to use up to 32GB ram cache), peaking
>> streaming throughput at 10Gbps, ATS has to be killed/restared in about 2
>> weeks.  At peak hours, there are about 5k-6k client connections and less
>> than 1k upstream connections (to mid tier caches).
>> 
>> We did some analysis on the Freelist dump (kill -USR1 pid) output (an
>> example is attached) and found the allocated in ioBufAllocator[0-14] slots
>> appeared to be main contributor to the total and also likely to be the
>> source of the increase overtime.
>> 
>> In terms of configurations and plugin usage,  in addition to ram_cache
>> setting to 32GB, we also
>> changed proxy.config.http.default_buffer_water_mark INT 15000000 (from
>> default 64k) to allow the entire video segment to be buffered on the
>> upstream connection to avoid client starvation issue when the first client
>> comes from a slow draining link, and
>> proxy.config.cache.target_fragment_size INT 4096 to allow upstream chunked
>> responses to be written into cache storage timely.  There is no connection
>> limits (# of connections appeared to be always in the normal range).  The
>> inactivity timeout values are fairly low (<120 secs).The only plugin we
>> used is header_rewrite.so. No https, no http/2.
>> 
>> I would appreciate if someone can shed some lights on how to further track
>> this down, and any practical tips for short term mitigation. In particular:
>> 1. Inside HttpSM, which states require allocate/re-use ioBuf? Is there a
>> way to put a ceiling on each slot or total allocation?
>> 2. Is the ioBufAllocation ceiling a function of total connections in which
>> case I should set a connection limit?
>> 3. The memory/RamCacheLRUEntry shows 5.2M, how is this related to the
>> actual ram_cache usage reported by traffic_top (32GB used)?
>> 4. At the point of the freelist dump, ATS process size was 78GB, the
>> freelist total showed about 44GB, with 32GB ram_cache used (traffic_top
>> reports). Assuming these two number are not overlapping, I also know the
>> in-memory (disk) directory entry cache takes at least 10GB, then these
>> numbers do add up. 44+32+10 >> 78. What am I missing?
>> 
>> 
>> Thanks,
>> -Hongfei
>> 

Reply via email to