Last week a small group gathered to talk about various technical issues in Traffic Server with a strong focus on "upstream selection" which is currently handled by various mechanisms such as parent selection and CARP. These are the notes from that meeting. As usual, nothing was actually decided but work is starting on some of the discussed projects and that work will be presented independently once it is more mature. At some point I will put this Wiki, once I remember my login credentials.
Next Gen Next Hop Mini-Summit ============================= Use Cases --------- Use Case: An embedded CDN (where a specialized user agent does the access). When a user agent does a host resolution it gets the address of a traffic router. The traffic router accepts the initial user agent request and responds with a 302 to redirect it to an edge (layer 1) proxy. What would be an additional feature would be to be able to use a "replication count" here and have the traffic router pass back an ordered list of proxies. Part of the reason for this is to use a user hosted cache which cannot effectively be monitored therefore the initial redirect must contain a failover from that at a minimum. Use Case CDN is three layers. Layer 1: This does primarily layer 7 routing. Groups of machines share an IP address and anycast is used to pick one of the machines. The ingress peer then forwards the request to layer 2. Some caching is done in L1 but it's unimportant. Layer 2: This is the "fast tier" which is only SSD storage. This uses a cache promotion plugin to control writes to the SSDs. Layer 3: This is the "slow tier" using rotational storage. While slower this is also much larger (~500TB per host). Separately an "origin" exists which is two layers of caching over the "real" origins. This is currently S3 storage which can be highly variable in response times. The pseudo-origin protects the other CDNs from this variability. Overall this architecture enables a real hybrid CDN. Different end user CDNs can be deployed that draw from the pseudo-origin to get content (e.g. in house CDN vs. Akamai). Conversely different storage origins can be used which are unified by the pseudo-origin. General Topics ============== Features -------- Ram only cache Doing caching without any backing persistent storage. Replace Cache Inspector with command line API. It was agreed that the web interface can be removed if a host local interface is provided. Leif wants this to be first a C based API which is a cleaned up / improved version of what is already present. Then a command line tool (e.g. traffic_ctl or equivalent). If a web interface is needed that should be separate (proprietary or something like Traffic Control.) Memory Management Replace current stuff with jemalloc. Phil notes jemalloc has an API which provides a lot of control of how memory is allocated which could replace the current functionality Traffic Server gets from our custom memory systems. Introspection Leif wants much better introspection and access to internal state. An example would be upstream health checks. Such a plugin or external process should be able to "push" upstream status to Traffic Server. * Upstream status * Upstream selection * Cacheability Process unification Phil wants to unify traffic_cop, traffic_manager, and traffic_server. Just have one and depend on more mature OS hosted process management. Remap/Parent/Override Unification Phil says we need to do all this at once. The client request is used to select a configuration object which specifies all of these at one time. Issues ------ Sudheer's Junk The write fail action not working quite right. Serialization of non-cacheable content requests Same URL but distinct content which is therefore marked as non-cacheable by the upstream. If UA2 has to wait for UA1 to come back but can't use the UA1 response, there can be very long delays if these get stacked. Thundering herd and connection collapsing This should be handled by POC. CLFUS broken Gets bad hit rates after a "while". Management API should be bidirectional. Enable the traffic_server process to pass back information to command line tools. Note that if the processes are unified then making the management API bidirectional becomes trivial. * Plugin responses. * Cache inspection API accessible from command line. Upstream health status How is this tracked? If there are active health checks what is the scope of the health checks. It is an open issue how the list is mainained or where the data is maintained, even if the upstream status is per upstream pod or globally (via HostDB). General Next Hop Selection Consolidation Issues ---------------------------------------------- * If TheQuartz is moved in to the core it may be desirable to also move some features of cache promotion in to the core as well. The question is how much of problem hot objects are in the general case and whether dealing with that (as cache promotion does) is necessary. * Leif suggests that the parent selection configuration can be simplified even further by unifying ‘ring’ and ‘pod’ to make the selection criteria a property rather than fundamental. * Look at having a callback during upstream selection. It should have access to the selection object for the transaction and be able to either completely override the selection or to "run" the existing selection and then tweak it. A secondary point here is moving all upstream selection to plugins and providing default plugins that perform current upstream selection functionality. Tasks ----- # Self detection (is a node in a upstream selection configuration this host?) # Active health checks of some sort. * Standard plugin? * Need API for the plugin to access the upstream node status. # API for upstream selection. * Want to minimize HttpSM interaction. HttpSM should only know about "select next upstream target". * A full API will require API to examine / manipulate upstream selection objects. This may push this over all target out. * IP generator interface to HttpSM? # Lua config. Dreams ------ Remap / upstream / Lua consolidation Start with updating parent.config to upstream.config. This starts as a Lua configuration and contains the skeleton of being able to specify a full configuration override. In the long term this replaces remap.config,parent.config, and potentially all of the other configs in the really long term. This also implies that we have a better namespace setup so that any configuration value is globally unambiguous. A key question is how the lookup is done to find the correct element in the upstream.config based on the user agent request. An element would have a member which describes the matching criteria for that element which selects defined properties of the request and the values against which to check them. Leif of course wants to be able to specific Lua scripts to "do stuff" instead of having explicit overrides. Leif says we all need to write example configurations as examples of this style and iteratively converge on an acceptable format. Later conversations turned on the idea of having only a "upstream selection" set of API callbacks in the same manner as the current remap plugins work. In this case the current functionality would be placed in a plugin distributed with the source code. All of the parent / upstream selection logic in the core would be removed and put in to the plugin. This would provide the basics along with the ability to extend or the plugin or replace it entirely. One function that needs to remain in the core because every plugin will use it is the self identification mechanism. One possibility is to have the FQDN resolution return a "self" marker. Or the plugin can return self and the core can treat that as a "failure" and immediately ask for the next upstream target. Long discussion on Lua data modeling. General concensus is we go with the generic tree approach to reading Lua configs. The Lua configs create a single global tree that is the configuration. This is then copied in to C++ memory space as a (roughly) TSConfig data structure (the current TSConfig parsing is replaced with reading from the Lua data). Core subsystems walk this C++ structure and generate their specialized internal state. Leif's analogy is Lua is the front end syntax, TSConfig is the RTL / backend data structure generated by a compiler, which is then executed by the appropriate subsystem. This isolates the configuration file loading and Lua invocation to a single piece of code rather than scattering it and requiring every subsystem to interact with Lua. It also means the RTL phase can be run very early during process start as it only copies generic data items, it does no interpretation of that data. This avoids various nasty race conditions. A downside is this is contrary to the current Lua config model and therefore those will need to be changed. Bryan and Alan note that, for them, the only key features required for the configuration is that it be nested and have references (so a common piece of data can be specified once and used many time inside the configuration). Leif wants to be able to load arbitrary configuration values from arbitrary files in arbitrary order. This is much easier with a global generic tree style as new values can be put in anywhere in the tree. JEMALLOC -------- One key issue is the non-fungibility of allocated memory. Once a byte of memory is allocated an an object of size N it can never be re-used for an object of a different size. This can lead to a situation where due to change in user agent behavior memory can be "trapped" in objects of the wrong size where ATS fails despite the presence of sufficient unused memory. Using jemalloc can conflict with ``DONT_DUMP`` settings which are needed for deployments with large physical memory. jemalloc provides hooks that would enable * Setting ``DONT_DUMP`` for specific memory allocations (e.g. the IO buffers) * NUMA affinity * Huge pages Phils says "just do it and stop your whining". Threads ------- Phil and Leif got in a long argument about threads. As I understand it the concensus is Event / non-blocking threads These are for either event based processing or for computations that are non-blocking but are in the direct line of data flow in a transaction. If a transaction can't proceed until the computation is done there is no benefit to putting it on another thread. Blocking threads For blocking or long lasting computations. These would run at a lower thread priority than event threads. In current terminology these would correspond to ``ET_NET`` and ``ET_TASK`` threads. More specialized threads would be shift to used the blocking threads (e.g. config reload). It would be good to move to more of a grouped dispatch model where a set of task threads are associated with a specific NUMA node and those threads share a dispatch queue. Phil has a scheme where each AIO processor has an event file descriptor (``event_fd``). When a continuation schedules I/O it tells the event loop to wait on the event_fd with itself as the associated continuation. The AIO processor, upon completion of the I/O operations, just signals the event_fd which then wakes up the continuation on its original thread without event scheduling. Visions ------- One goal that came up again like a bad dinner is the idea of being able to run two cooperating ATS instances. The purpose is to do a smooth(er) transition between configurations of ATS that require a restart. This involves some close coordination between the processes. * SO_REUSEPORT would need to be implemented to have both processes sharing accepted connections. * Many data structures would need to be moved to shared memory. This is not as bad as it sounds because any real deployment of ATS doesn't tolerate paging or swapping so all memory is de facto wired already. * Sharing the cache is probably the most challenging issue. One approach that could simplify the implementation is restricting writing to only one ATS process. This would in turn require * A coordination protocol between the instances to handle transfer of control so cache writes do not have to wait for process start. * Potentially transfer writes from the old process to starting process. * Locks in shared memory? This would involve sharing the OpenDirEntry objects in addition to the actual stripe directories.