Last week a small group gathered to talk about various technical issues in
Traffic Server with a strong focus on "upstream selection" which is currently
handled by various mechanisms such as parent selection and CARP. These are the
notes from that meeting. As usual, nothing was actually decided but work is
starting on some of the discussed projects and that work will be presented
independently once it is more mature. At some point I will put this Wiki, once
I remember my login credentials.
Next Gen Next Hop Mini-Summit
=============================
Use Cases
---------
Use Case:
An embedded CDN (where a specialized user agent does the access). When a
user agent does a host
resolution it gets the address of a traffic router. The traffic router
accepts the initial user
agent request and responds with a 302 to redirect it to an edge (layer 1)
proxy. What would be an
additional feature would be to be able to use a "replication count" here and
have the traffic router
pass back an ordered list of proxies. Part of the reason for this is to use
a user hosted cache
which cannot effectively be monitored therefore the initial redirect must
contain a failover from
that at a minimum.
Use Case
CDN is three layers.
Layer 1:
This does primarily layer 7 routing. Groups of machines share an IP
address and anycast is used to pick one of the machines. The ingress peer then
forwards the request to layer 2. Some caching is done in L1 but it's
unimportant.
Layer 2:
This is the "fast tier" which is only SSD storage. This uses a cache
promotion plugin to control writes to the SSDs.
Layer 3:
This is the "slow tier" using rotational storage. While slower this is
also much larger (~500TB per host).
Separately an "origin" exists which is two layers of caching over the "real"
origins. This is
currently S3 storage which can be highly variable in response times. The
pseudo-origin protects
the other CDNs from this variability.
Overall this architecture enables a real hybrid CDN. Different end user CDNs
can be deployed that
draw from the pseudo-origin to get content (e.g. in house CDN vs. Akamai).
Conversely different
storage origins can be used which are unified by the pseudo-origin.
General Topics
==============
Features
--------
Ram only cache
Doing caching without any backing persistent storage.
Replace Cache Inspector with command line API.
It was agreed that the web interface can be removed if a host local
interface is provided. Leif
wants this to be first a C based API which is a cleaned up / improved
version of what is already
present. Then a command line tool (e.g. traffic_ctl or equivalent). If a web
interface is needed
that should be separate (proprietary or something like Traffic Control.)
Memory Management
Replace current stuff with jemalloc. Phil notes jemalloc has an API which
provides a lot of
control of how memory is allocated which could replace the current
functionality Traffic Server
gets from our custom memory systems.
Introspection
Leif wants much better introspection and access to internal state. An
example would be upstream
health checks. Such a plugin or external process should be able to "push"
upstream status to
Traffic Server.
* Upstream status
* Upstream selection
* Cacheability
Process unification
Phil wants to unify traffic_cop, traffic_manager, and traffic_server. Just
have one and depend on
more mature OS hosted process management.
Remap/Parent/Override Unification
Phil says we need to do all this at once. The client request is used to
select a configuration
object which specifies all of these at one time.
Issues
------
Sudheer's Junk
The write fail action not working quite right.
Serialization of non-cacheable content requests
Same URL but distinct content which is therefore marked as non-cacheable by
the upstream. If UA2
has to wait for UA1 to come back but can't use the UA1 response, there can
be very long delays if
these get stacked.
Thundering herd and connection collapsing
This should be handled by POC.
CLFUS broken
Gets bad hit rates after a "while".
Management API should be bidirectional.
Enable the traffic_server process to pass back information to command line
tools. Note that if
the processes are unified then making the management API bidirectional
becomes trivial.
* Plugin responses.
* Cache inspection API accessible from command line.
Upstream health status
How is this tracked? If there are active health checks what is the scope of
the health checks. It
is an open issue how the list is mainained or where the data is maintained,
even if the upstream
status is per upstream pod or globally (via HostDB).
General Next Hop Selection Consolidation Issues
----------------------------------------------
* If TheQuartz is moved in to the core it may be desirable to also move some
features of cache
promotion in to the core as well. The question is how much of problem hot
objects are in the
general case and whether dealing with that (as cache promotion does) is
necessary.
* Leif suggests that the parent selection configuration can be simplified even
further by unifying
‘ring’ and ‘pod’ to make the selection criteria a property rather than
fundamental.
* Look at having a callback during upstream selection. It should have access
to the selection
object for the transaction and be able to either completely override the
selection or to "run"
the existing selection and then tweak it.
A secondary point here is moving all upstream selection to plugins and
providing default plugins
that perform current upstream selection functionality.
Tasks
-----
# Self detection (is a node in a upstream selection configuration this host?)
# Active health checks of some sort.
* Standard plugin?
* Need API for the plugin to access the upstream node status.
# API for upstream selection.
* Want to minimize HttpSM interaction. HttpSM should only know about
"select next upstream target".
* A full API will require API to examine / manipulate upstream selection
objects. This may push
this over all target out.
* IP generator interface to HttpSM?
# Lua config.
Dreams
------
Remap / upstream / Lua consolidation
Start with updating parent.config to upstream.config. This starts as a Lua
configuration and
contains the skeleton of being able to specify a full configuration
override. In the long term
this replaces remap.config,parent.config, and potentially all of the other
configs in the really
long term. This also implies that we have a better namespace setup so that
any configuration
value is globally unambiguous.
A key question is how the lookup is done to find the correct element in the
upstream.config based
on the user agent request. An element would have a member which describes
the matching criteria
for that element which selects defined properties of the request and the
values against which to
check them.
Leif of course wants to be able to specific Lua scripts to "do stuff"
instead of having explicit
overrides.
Leif says we all need to write example configurations as examples of this
style and iteratively
converge on an acceptable format.
Later conversations turned on the idea of having only a "upstream selection"
set of API callbacks
in the same manner as the current remap plugins work. In this case the
current functionality
would be placed in a plugin distributed with the source code. All of the
parent / upstream
selection logic in the core would be removed and put in to the plugin. This
would provide the
basics along with the ability to extend or the plugin or replace it
entirely. One function that
needs to remain in the core because every plugin will use it is the self
identification
mechanism. One possibility is to have the FQDN resolution return a "self"
marker. Or the plugin
can return self and the core can treat that as a "failure" and immediately
ask for the next
upstream target.
Long discussion on Lua data modeling. General concensus is we go with the
generic tree approach
to reading Lua configs. The Lua configs create a single global tree that is
the configuration.
This is then copied in to C++ memory space as a (roughly) TSConfig data
structure (the current
TSConfig parsing is replaced with reading from the Lua data). Core
subsystems walk this C++
structure and generate their specialized internal state. Leif's analogy is
Lua is the front end
syntax, TSConfig is the RTL / backend data structure generated by a
compiler, which is then
executed by the appropriate subsystem. This isolates the configuration file
loading and Lua
invocation to a single piece of code rather than scattering it and requiring
every subsystem to
interact with Lua. It also means the RTL phase can be run very early during
process start as it
only copies generic data items, it does no interpretation of that data. This
avoids various nasty
race conditions.
A downside is this is contrary to the current Lua config model and therefore
those will need to
be changed.
Bryan and Alan note that, for them, the only key features required for the
configuration is that
it be nested and have references (so a common piece of data can be specified
once and used many
time inside the configuration).
Leif wants to be able to load arbitrary configuration values from arbitrary
files in arbitrary
order. This is much easier with a global generic tree style as new values
can be put in anywhere
in the tree.
JEMALLOC
--------
One key issue is the non-fungibility of allocated memory. Once a byte of memory
is allocated an an
object of size N it can never be re-used for an object of a different size.
This can lead to a
situation where due to change in user agent behavior memory can be "trapped" in
objects of the wrong
size where ATS fails despite the presence of sufficient unused memory.
Using jemalloc can conflict with ``DONT_DUMP`` settings which are needed for
deployments with large
physical memory.
jemalloc provides hooks that would enable
* Setting ``DONT_DUMP`` for specific memory allocations (e.g. the IO buffers)
* NUMA affinity
* Huge pages
Phils says "just do it and stop your whining".
Threads
-------
Phil and Leif got in a long argument about threads. As I understand it the
concensus is
Event / non-blocking threads
These are for either event based processing or for computations that are
non-blocking but are in
the direct line of data flow in a transaction. If a transaction can't
proceed until the
computation is done there is no benefit to putting it on another thread.
Blocking threads
For blocking or long lasting computations. These would run at a lower thread
priority than event
threads.
In current terminology these would correspond to ``ET_NET`` and ``ET_TASK``
threads. More
specialized threads would be shift to used the blocking threads (e.g. config
reload). It would be
good to move to more of a grouped dispatch model where a set of task threads
are associated with a
specific NUMA node and those threads share a dispatch queue.
Phil has a scheme where each AIO processor has an event file descriptor
(``event_fd``). When a
continuation schedules I/O it tells the event loop to wait on the event_fd with
itself as the
associated continuation. The AIO processor, upon completion of the I/O
operations, just signals the
event_fd which then wakes up the continuation on its original thread without
event scheduling.
Visions
-------
One goal that came up again like a bad dinner is the idea of being able to run
two cooperating ATS
instances. The purpose is to do a smooth(er) transition between configurations
of ATS that require a
restart. This involves some close coordination between the processes.
* SO_REUSEPORT would need to be implemented to have both processes sharing
accepted connections.
* Many data structures would need to be moved to shared memory. This is not as
bad as it sounds
because any real deployment of ATS doesn't tolerate paging or swapping so
all memory is de facto
wired already.
* Sharing the cache is probably the most challenging issue. One approach that
could simplify the
implementation is restricting writing to only one ATS process. This would in
turn require
* A coordination protocol between the instances to handle transfer of
control so cache writes do
not have to wait for process start.
* Potentially transfer writes from the old process to starting process.
* Locks in shared memory? This would involve sharing the OpenDirEntry
objects in addition to the
actual stripe directories.