Notes from Interim Technical Meeting

Alan M. Carroll Tue, 24 Jan 2017 08:45:53 -0800

Last week a small group gathered to talk about various technical issues in 
Traffic Server with a strong focus on "upstream selection" which is currently 
handled by various mechanisms such as parent selection and CARP. These are the 
notes from that meeting. As usual, nothing was actually decided but work is 
starting on some of the discussed projects and that work will be presented 
independently once it is more mature. At some point I will put this Wiki, once 
I remember my login credentials.


Next Gen Next Hop Mini-Summit
=============================

Use Cases
---------

Use Case:
   An embedded CDN (where a specialized user agent does the access). When a 
user agent does a host
   resolution it gets the address of a traffic router. The traffic router 
accepts the initial user
   agent request and responds with a 302 to redirect it to an edge (layer 1) 
proxy. What would be an
   additional feature would be to be able to use a "replication count" here and 
have the traffic router
   pass back an ordered list of proxies. Part of the reason for this is to use 
a user hosted cache
   which cannot effectively be monitored therefore the initial redirect must 
contain a failover from
   that at a minimum.

Use Case
   CDN is three layers.

   Layer 1:
      This does primarily layer 7 routing. Groups of machines share an IP 
address and anycast is used to pick one of the machines. The ingress peer then 
forwards the request to layer 2. Some caching is done in L1 but it's 
unimportant.

   Layer 2:
      This is the "fast tier" which is only SSD storage. This uses a cache 
promotion plugin to control writes to the SSDs.

   Layer 3:
      This is the "slow tier" using rotational storage. While slower this is 
also much larger (~500TB per host).

   Separately an "origin" exists which is two layers of caching over the "real" 
origins. This is
   currently S3 storage which can be highly variable in response times. The 
pseudo-origin protects
   the other CDNs from this variability.

   Overall this architecture enables a real hybrid CDN. Different end user CDNs 
can be deployed that
   draw from the pseudo-origin to get content (e.g. in house CDN vs. Akamai). 
Conversely different
   storage origins can be used which are unified by the pseudo-origin.

General Topics
==============

Features
--------

Ram only cache
   Doing caching without any backing persistent storage.

Replace Cache Inspector with command line API.
   It was agreed that the web interface can be removed if a host local 
interface is provided. Leif
   wants this to be first a C based API which is a cleaned up / improved 
version of what is already
   present. Then a command line tool (e.g. traffic_ctl or equivalent). If a web 
interface is needed
   that should be separate (proprietary or something like Traffic Control.)

Memory Management
   Replace current stuff with jemalloc. Phil notes jemalloc has an API which 
provides a lot of
   control of how memory is allocated which could replace the current 
functionality Traffic Server
   gets from our custom memory systems.

Introspection
   Leif wants much better introspection and access to internal state. An 
example would be upstream
   health checks. Such a plugin or external process should be able to "push" 
upstream status to
   Traffic Server.

   * Upstream status
   * Upstream selection
   * Cacheability

Process unification
   Phil wants to unify traffic_cop, traffic_manager, and traffic_server. Just 
have one and depend on
   more mature OS hosted process management.

Remap/Parent/Override Unification
   Phil says we need to do all this at once. The client request is used to 
select a configuration
   object which specifies all of these at one time.

Issues
------

Sudheer's Junk
   The write fail action not working quite right.

Serialization of non-cacheable content requests
   Same URL but distinct content which is therefore marked as non-cacheable by 
the upstream. If UA2
   has to wait for UA1 to come back but can't use the UA1 response, there can 
be very long delays if
   these get stacked.

Thundering herd and connection collapsing
   This should be handled by POC.

CLFUS broken
   Gets bad hit rates after a "while".

Management API should be bidirectional.
   Enable the traffic_server process to pass back information to command line 
tools. Note that if
   the processes are unified then making the management API bidirectional 
becomes trivial.

   *  Plugin responses.
   *  Cache inspection API accessible from command line.

Upstream health status
   How is this tracked? If there are active health checks what is the scope of 
the health checks. It
   is an open issue how the list is mainained or where the data is maintained, 
even if the upstream
   status is per upstream pod or globally (via HostDB).

General Next Hop Selection Consolidation Issues
----------------------------------------------

*  If TheQuartz is moved in to the core it may be desirable to also move some 
features of cache
   promotion in to the core as well. The question is how much of problem hot 
objects are in the
   general case and whether dealing with that (as cache promotion does) is 
necessary.
*  Leif suggests that the parent selection configuration can be simplified even 
further by unifying
   ‘ring’ and ‘pod’ to make the selection criteria a property rather than 
fundamental.
*  Look at having a callback during upstream selection. It should have access 
to the selection
   object for the transaction and be able to either completely override the 
selection or to "run"
   the existing selection and then tweak it.

   A secondary point here is moving all upstream selection to plugins and 
providing default plugins
   that perform current upstream selection functionality.

Tasks
-----

#  Self detection (is a node in a upstream selection configuration this host?)
#  Active health checks of some sort.
   *  Standard plugin?
   *  Need API for the plugin to access the upstream node status.
#  API for upstream selection.
   *  Want to minimize HttpSM interaction. HttpSM should only know about 
"select next upstream target".
   *  A full API will require API to examine / manipulate upstream selection 
objects. This may push
      this over all target out.
   *  IP generator interface to HttpSM?
#  Lua config.

Dreams
------

Remap / upstream / Lua consolidation
   Start with updating parent.config to upstream.config. This starts as a Lua 
configuration and
   contains the skeleton of being able to specify a full configuration 
override. In the long term
   this replaces remap.config,parent.config, and potentially all of the other 
configs in the really
   long term. This also implies that we have a better namespace setup so that 
any configuration
   value is globally unambiguous.

   A key question is how the lookup is done to find the correct element in the 
upstream.config based
   on the user agent request. An element would have a member which describes 
the matching criteria
   for that element which selects defined properties of the request and the 
values against which to
   check them.

   Leif of course wants to be able to specific Lua scripts to "do stuff" 
instead of having explicit
   overrides.

   Leif says we all need to write example configurations as examples of this 
style and iteratively
   converge on an acceptable format.

   Later conversations turned on the idea of having only a "upstream selection" 
set of API callbacks
   in the same manner as the current remap plugins work. In this case the 
current functionality
   would be placed in a plugin distributed with the source code. All of the 
parent / upstream
   selection logic in the core would be removed and put in to the plugin. This 
would provide the
   basics along with the ability to extend or the plugin or replace it 
entirely. One function that
   needs to remain in the core because every plugin will use it is the self 
identification
   mechanism. One possibility is to have the FQDN resolution return a "self" 
marker. Or the plugin
   can return self and the core can treat that as a "failure" and immediately 
ask for the next
   upstream target.

   Long discussion on Lua data modeling. General concensus is we go with the 
generic tree approach
   to reading Lua configs. The Lua configs create a single global tree that is 
the configuration.
   This is then copied in to C++ memory space as a (roughly) TSConfig data 
structure (the current
   TSConfig parsing is replaced with reading from the Lua data). Core 
subsystems walk this C++
   structure and generate their specialized internal state. Leif's analogy is 
Lua is the front end
   syntax, TSConfig is the RTL / backend data structure generated by a 
compiler, which is then
   executed by the appropriate subsystem. This isolates the configuration file 
loading and Lua
   invocation to a single piece of code rather than scattering it and requiring 
every subsystem to
   interact with Lua. It also means the RTL phase can be run very early during 
process start as it
   only copies generic data items, it does no interpretation of that data. This 
avoids various nasty
   race conditions.

   A downside is this is contrary to the current Lua config model and therefore 
those will need to
   be changed.

   Bryan and Alan note that, for them, the only key features required for the 
configuration is that
   it be nested and have references (so a common piece of data can be specified 
once and used many
   time inside the configuration).

   Leif wants to be able to load arbitrary configuration values from arbitrary 
files in arbitrary
   order. This is much easier with a global generic tree style as new values 
can be put in anywhere
   in the tree.

JEMALLOC
--------

One key issue is the non-fungibility of allocated memory. Once a byte of memory 
is allocated an an
object of size N it can never be re-used for an object of a different size. 
This can lead to a
situation where due to change in user agent behavior memory can be "trapped" in 
objects of the wrong
size where ATS fails despite the presence of sufficient unused memory.

Using jemalloc can conflict with ``DONT_DUMP`` settings which are needed for 
deployments with large
physical memory.

jemalloc provides hooks that would enable

*  Setting ``DONT_DUMP`` for specific memory allocations (e.g. the IO buffers)
*  NUMA affinity
*  Huge pages

Phils says "just do it and stop your whining".

Threads
-------

Phil and Leif got in a long argument about threads. As I understand it the 
concensus is

Event / non-blocking threads
   These are for either event based processing or for computations that are 
non-blocking but are in
   the direct line of data flow in a transaction. If a transaction can't 
proceed until the
   computation is done there is no benefit to putting it on another thread.

Blocking threads
   For blocking or long lasting computations. These would run at a lower thread 
priority than event
   threads.

In current terminology these would correspond to ``ET_NET`` and ``ET_TASK`` 
threads. More
specialized threads would be shift to used the blocking threads (e.g. config 
reload). It would be
good to move to more of a grouped dispatch model where a set of task threads 
are associated with a
specific NUMA node and those threads share a dispatch queue.

Phil has a scheme where each AIO processor has an event file descriptor 
(``event_fd``). When a
continuation schedules I/O it tells the event loop to wait on the event_fd with 
itself as the
associated continuation. The AIO processor, upon completion of the I/O 
operations, just signals the
event_fd which then wakes up the continuation on its original thread without 
event scheduling.

Visions
-------

One goal that came up again like a bad dinner is the idea of being able to run 
two cooperating ATS
instances. The purpose is to do a smooth(er) transition between configurations 
of ATS that require a
restart. This involves some close coordination between the processes.

*  SO_REUSEPORT would need to be implemented to have both processes sharing 
accepted connections.
*  Many data structures would need to be moved to shared memory. This is not as 
bad as it sounds
   because any real deployment of ATS doesn't tolerate paging or swapping so 
all memory is de facto
   wired already.
*  Sharing the cache is probably the most challenging issue. One approach that 
could simplify the
   implementation is restricting writing to only one ATS process. This would in 
turn require
   *  A coordination protocol between the instances to handle transfer of 
control so cache writes do
      not have to wait for process start.
   *  Potentially transfer writes from the old process to starting process.
   *  Locks in shared memory? This would involve sharing the OpenDirEntry 
objects in addition to the
      actual stripe directories.

Notes from Interim Technical Meeting

Reply via email to