----- Original Message ----- > Updated Branches: > refs/heads/master 2b052f35b -> 0c953e129 > > > Doc: Clean up in cache architecture, added information about stripe > assignment initialization. > > > Project: http://git-wip-us.apache.org/repos/asf/trafficserver/repo > Commit: http://git-wip-us.apache.org/repos/asf/trafficserver/commit/0c953e12 > Tree: http://git-wip-us.apache.org/repos/asf/trafficserver/tree/0c953e12 > Diff: http://git-wip-us.apache.org/repos/asf/trafficserver/diff/0c953e12 > > Branch: refs/heads/master > Commit: 0c953e129225ed6db59a97d293d26ecae3f22c92 > Parents: 2b052f3 > Author: Alan M. Carroll <a...@network-geographics.com> > Authored: Wed Dec 18 14:55:20 2013 -0600 > Committer: Alan M. Carroll <a...@network-geographics.com> > Committed: Wed Dec 18 14:55:20 2013 -0600 > > ---------------------------------------------------------------------- > doc/arch/cache/cache-arch.en.rst | 148 ++++++++++++++++++++++------------ > doc/glossary.en.rst | 42 +++++++++- > 2 files changed, 139 insertions(+), 51 deletions(-) > ---------------------------------------------------------------------- > > > http://git-wip-us.apache.org/repos/asf/trafficserver/blob/0c953e12/doc/arch/cache/cache-arch.en.rst > ---------------------------------------------------------------------- > diff --git a/doc/arch/cache/cache-arch.en.rst > b/doc/arch/cache/cache-arch.en.rst > index 0018c3a..44b5a75 100644 > --- a/doc/arch/cache/cache-arch.en.rst > +++ b/doc/arch/cache/cache-arch.en.rst > @@ -25,9 +25,8 @@ Introduction > > In addition to an HTTP proxy, |ATS| is also an HTTP cache. |TS| can cache > any octet stream although it currently > supports only those octet streams delivered by the HTTP protocol. When such > a stream is cached (along with the HTTP > -protocol headers) it is termed an *object* in the cache. Each object is > identified by a globally unique value called a > -*cache key*. By default this is generated by taking the `MD5 hash > <http://www.openssl.org/docs/crypto/md5.html>`_ of the > -URI used to retrieve the content from the origin server. > +protocol headers) it is termed an :term:`object <cache object>` in the > cache. Each object is identified by a globally > +unique value called a :term:`cache key`. > > The purpose of this document is to describe the basic structure and > implementation details of the |TS| cache. > Configuration of the cache will be discussed only to the extent needed to > understand the internal mechanisms. This > @@ -41,20 +40,28 @@ different ways than they are used in the code in an > attempt to create some consi > Cache Layout > ~~~~~~~~~~~~ > > -The first step in understanding cache operations is to understand the data > structures and layout of the cache. > +The following sections describe how persistent cache data is structured. > |TS| treats its persisent storage an > +undifferentiated collection of bytes, assuming no other structure to it. In > particular it does not use the file system > +of the host operating system. If a file is used it is used only to mark out > the set of bytes to be used. > > Cache storage > ============= > > -The raw storage for the |TS| cache is configured in :file:`storage.config`. > Each line in the file defines a :term:`cache span` which is treated as an > undifferentiated unit of storage. > +The raw storage for the |TS| cache is configured in :file:`storage.config`. > Each line in the file defines a :term:`cache > +span` which is treated as a uniform persistent store. > > .. figure:: images/cache-spans.png > :align: center > > Two cache spans > > -This storage organized in to a set of :term:`cache volume`\ s which are > defined in :file:`volume.config`. Cache volumes > -can be defined by a percentage of the total storage or an absolute amount of > storage. By default each cache volume is spread across all of the cache > spans for robustness. The intersection of a cache volume and a cache span is > a :term:`cache stripe`. Each cache span is divided in to cache stripes and > each cache volume is a collection of those stripes. > +This storage organized in to a set of :term:`cache volume`\ s which are
in to, not into? > defined in :file:`volume.config` for the > +purposes of the administrator. These are the units that used for all other > administator level configuration. > + > +Cache volumes can be defined by a percentage of the total storage or an > absolute amount of storage. By default each > +cache volume is spread across all of the cache spans for robustness. The > intersection of a cache volume and a cache span > +is a :term:`cache stripe`. Each cache span is divided in to cache stripes > and each cache volume is a collection of those > +stripes. > > If the cache volumes for the example cache spans were defined as > > @@ -66,12 +73,12 @@ then the actual layout would look like > .. image:: images/cache-span-layout.png > :align: center > > -A cached object is stored entirely in a single stripe, and therefore in a > single cache span - objects are never split > -across cache volumes. Objects are assigned to a stripe (and hence to a cache > volume) automatically based on a hash of > -the URI used to retrieve the object from the origin server. It is possible > to configure this to a limited extent in > -:file:`hosting.config` which supports content from specific host or domain > to be stored on specific volumes. In > -addition, as of version 4.0.1 it is possible to control which cache spans > (and hence, which cache stripes) are contained what is the benefit of that? > -in a specific cache volume. > +Cache stripes are the fundamental unit of cache for the implementation. A > cached object is stored entirely in a single > +stripe, and therefore in a single cache span - objects are never split > across cache spans or volumes. Objects are > +assigned to a stripe (and hence to a cache volume) automatically based on a > hash of the URI used to retrieve the object > +from the origin server. It is possible to configure this to a limited extent > in :file:`hosting.config` which supports > +content from specific host or domain to be stored on specific cache volumes. > In addition, as of version 4.0.1 it is > +possible to control which cache spans (and hence, which cache stripes) are > contained in a specific cache volume. > > The layout and structure of the cache spans, the cache volumes, and the > cache stripes that compose them are derived > entirely from the :file:`storage.config` and :file:`cache.config` and is > recomputed from scratch when the > @@ -108,9 +115,9 @@ Content in a stripe is tracked via a directory. We call > each element of the dire > represented by :cpp:class:`Dir`. Each entry refers to a chunk of contiguous > storage in the cache. These are referred to > variously as "fragments", "segments", "docs" / "documents", and a few other > things. This document will use the term > "fragment" as that is the most common reference in the code. The term "Doc" > (for :cpp:class:`Doc`) will be used to refer > -to the header data for a fragment. Overall the directory is treated as a > hash with a "cache ID" as the key. A cache ID > -is a 128 bit (16 byte) value generated in various ways depending on context. > This ID is reduced and used as an index in > -to the directory to locate an entry in the standard way. > +to the header data for a fragment. Overall the directory is treated as a > hash with the :term:`cache ID` as the key. See > +:ref:`directory probing <cache-directory-probe>` for how the cache ID is > used to locate a directory entry. The cache ID > +is in turn computed from a :term:`cache key` which by default is the URL of > the content. > > The directory is used as a memory resident structure which means a directory > entry is as small as possible (currently 10 > bytes). This forces some compromises on the data that can be stored there. > On the other hand this means that most cache > @@ -126,44 +133,61 @@ there is enough memory to run |TS| with an empty cache > there is enough to run it > :align: center > > Each entry stores an offset in the stripe and a size. The size stored in the > directory entry is an :ref:`approximate > -size <dir-size>` which is at least as big as the actual data. Exact size > data is stored in the fragment header on disk. > +size <dir-size>` which is at least as big as the actual data in the > fragment. Exact size data is stored in the fragment > +header on disk. > > .. note:: > > - Data in HTTP headers cannot be examined without disk I/O. This includes > the original URL for the object. The original > - source of the cache ID is not stored anywhere. > + Data in HTTP headers cannot be examined without disk I/O. This includes > the original URL for the object. The cache > + key is not stored explicitly and therefore cannot be reliably retrieved. Huh? What does that mean for all those header_* plugins? > +The directory is a hash table that uses `chaining > +<http://en.wikibooks.org/wiki/Data_Structures/Hash_Tables#Collision_resolution>`_ > for collision resolution. Because each > +entry is small they are used directly as the list header of the hash bucket. > > .. _dir-segment: > .. _dir-bucket: > > -The entries in a directory are grouped. The first level grouping is a > *bucket*. This is a fixed number (currently 4 - > -defined as ``DIR_DEPTH``) of entries. The index generated from a cache ID is > used as a bucket index (not an entry index). > -Buckets are grouped in to *segments*. All segments in a stripe have the same > number of buckets. The number of segments > -in a stripe is chosen so that each segment has as many buckets as possible > without exceeding 65535 (2\ :sup:`16`\ -1) > -entries in a segment. Note that all segments in the same stripe will have > the same number of buckets. > +Chaining is implemented by imposing grouping structures on the entries in a > directory. The first level grouping is a > +:term:`directory bucket`. This is a fixed number (currently 4 - defined as > ``DIR_DEPTH``) of entries. This > +serves to define the basic hash buckets with the first entry in each cache > bucket serving as the root of the hash > +bucket. > + > +.. note:: > + > + The term "bucket" is used in the code to mean both the conceptual bucket > for hashing and for a structural grouping > + mechanism in the directory and so these will be qualified as needed to > distinguish them. The unqualified term > + "bucket" is almost always used to mean the structural grouping in the > directory. > + > +Directory buckets are grouped in to :term:`segments <directory segment>`. > All segments in a stripe have the same number of > +buckets. The number of segments in a stripe is chosen so that each segment > has as many buckets as possible without > +exceeding 65535 (2\ :sup:`16`\ -1) entries in a segment. > > .. figure:: images/dir-segment-bucket.png > :align: center > > -Each entry has a previous and next index value which is used to link the > entries in the same segment. The index size is > -16 bits which suffices to index any entry in the same segment. The stripe > header contains an array of entry indices > -which are used as the roots of a free list. When a stripe is initialized the > first entry in each bucket is zeroed > -(marked unused) and all other entries are put in the corresponding segment > free list rooted in the stripe header. In > -essence the first element of each fixed bucket is used as a root for that > bucket. The other entries in the fixed bucker > -are preferentially preferred for adding to that bucket but this is not > required. The segment free lists are initialized > -such that the extra bucket entries are added in order - all the seconds, > then the thirds, then the fourths. Because the > -free lists are FIFOs this means extra entries will be selected from the > fourth entries across all the buckets first, > -then the thirds, etc. This maximizes locality for bucket searching. > +Each directory entry has a previous and next index value which is used to > link entries in the same segment. Because no > +segment has more than 65535 entries 16 bits suffices for storing the index > values. The stripe header contains an array > +of entry indices which are used as the roots of entry free lists, one for > each segment. Active entries are stored via > +the bucket structure. When a stripe is initialized the first entry in each > bucket is zeroed (marked unused) and all > +other entries are put in the corresponding segment free list in the stripe > header. This means the first entry of each > +directory bucket is used as the root of a hash bucket and is therefore > marked unused rather than being put a free list. rather than being put *in* OR *on* a free list. > +The other entries in the directory bucket are preferentially preferred for "preferentially preferred" still makes me stumble.. is this intended? > adding to the corresponding hash bucket but > +this is not required. The segment free lists are initialized such that the > extra bucket entries are added in order - all > +the seconds, then the thirds, then the fourths. Because the free lists are > FIFOs this means extra entries will be > +selected from the fourth entries across all the buckets first, then the > thirds, etc. When allocating a new directory > +entry in a bucket the entries are searched from first to last, which > maximizes bucket locality (that is, cache IDs that > +map to the same hash bucket will also tend to use the same directory > bucket). > > .. figure:: images/dir-bucket-assign.png > :align: center > > Entries are removed from the free list when used and returned when no longer > in use. When a fragment needs to be put in > -to the directory the cache ID is used to locate a bucket (which also > determines the segment). If the first entry in the > -bucket is marked unused, it is used. If not then the other entries in the > bucket are searched and if any are on the free > -list, that entry is used. If none are available then the first entry on the > segment free list is used. This entry is > -attached to the bucket via the same next and previous indices used for the > free list so that it can be found when doing > -a lookup of a cache ID. > +to the directory the cache ID is used to locate a hash bucket (which also > determines the segment and directory bucket). > +If the first entry in the directory bucket is marked unused, it is used. If > not then the other entries in the bucket are > +searched and if any are on the free list, that entry is used. If none are > available then the first entry on the segment > +free list is used. This entry is attached to the hash bucket via the same > next and previous indices used for the free > +list so that it can be found when doing a lookup of a cache ID. > > Storage Layout > -------------- > @@ -435,10 +459,6 @@ maximum size that is at least as large as the fragment. > That will indicate the v > granularity of the approximate size. That represents the largest possible > amount of wasted disk I/O when the fragment is > read from disk. > > -.. note:: The cache index entry size is used only for reading the fragment > from disk. The actual size on disk, and the > -amount of cache space consumed, is the actual size of the content rounded up > to the disk sector size (default 512 > -bytes). > - > .. index:: DIR_DEPTH, index segment, index buckets > > The set of index entries for a volume are grouped in to *segments*. The > number of segments for an index is selected so > @@ -450,7 +470,7 @@ standard hash table way, giving somewhat less than 2^14 > buckets per segment. > > .. [#] The comment in :file:`records.config` is simply wrong. > > -.. _dir-probe: > +.. _cache-directory-probe: > > Directory Probing > ----------------- > @@ -552,7 +572,7 @@ There are three basic steps to a cache lookup. > > The cache key is used as a hash key in to an array of :cpp:class:`Vol` > instances. The construction and arrangement of this array is the essence > of how volumes are assigned. > > -#. The cache stripe directory :ref:`is probed <dir-probe>` using the index > key computed from the cache key. > +#. The cache stripe directory :ref:`is probed <cache-directory-probe>` using > the index key computed from the cache key. > > Various other lookaside directories are checked as well, such as the > :ref:`aggregation buffer <aggregation-buffer>`. > > @@ -744,7 +764,7 @@ This interacts with the "one at a time" strategy of the > aggregation write logic. > I do not understand the extra key list that is present in an evacuation > block. It is labeled as needed for > "collisions" but I am unclear on what might be colliding. The bucket > entries are stored and matched by stripe offset > but if two fragments collide on their offset, only one can be valid. > Based on how :ref:`directory probing > - <dir-probe>` works and the logic of :cpp:func:`evacuate_fragments()` it > appears that rather than determine which > + <cache-directory-probe>` works and the logic of > :cpp:func:`evacuate_fragments()` it appears that rather than determine which > entry in a directory bucket is the correct one, all of them are marked > for evacuation (thereby handling > "collisions"). However, each one could have a distinct fragment size and > that is set for all of the reads by the > first fragment found in the directory. The intent seems to be to read all > fragments that collide at the same starting > @@ -774,17 +794,45 @@ basically four types, > * Disk > * Raw device > > -After setting all the `Span` instances they are grouped by device id to > internal linked lists attached to the > +After creating all the `Span` instances they are grouped by device id to > internal linked lists attached to the > :cpp:member:`Store::disk` array [#]_. Spans that refer to the same > directory, disk, or raw device are coalesced in to a > -single span. Spans that refer to the same file with overlapping offsets are > also coalesced [#]_. This is all done in :c:func:`ink_cache_init()` called > during startup. > +single span. Spans that refer to the same file with overlapping offsets are > also coalesced [#]_. This is all done in > +:c:func:`ink_cache_init()` called during startup. > + > +.. note:: The span logic is also used by the HostDB and more than one > otherwise inexplicable feature is provided by the span logic for that > module. > > After configuration initialization the cache processor is started by calling > :ccp:func:`CacheProcessor::start()`. This > does a number of things. > > For each valid span, an instance of :cpp:class:`CacheDisk` is created. This > class is a continuation and so can be used > -to perform potentially blocking operations on the span. This what is passed > to the AIO threads to be called when an I/O > -operation completes. These are then dispatched to AIO threads to perform > storage unit initialization. After all of those > -have completed, the resulting storage is distributed across the volumes in > :c:func:`cplist_reconfigure`. The :cpp:class:`CacheVol` instances are > created at this time. > +to perform potentially blocking operations on the span. The primary use of > these is to be passed to the AIO threads as > +the callback when an I/O operation completes. These are then dispatched to > AIO threads to perform storage unit > +initialization. After all of those have completed, the resulting storage is > distributed across the volumes in > +:c:func:`cplist_reconfigure`. The :cpp:class:`CacheVol` instances are > created at this time. > + > +Cache stripe assignment setup is done once all stripes have initialized > (that is, the stripe header information has been > +successfully read from disk for all stripes). The assignment information is > stored as an array of indices. These are > +indices in to an array of stripes. Both the assignment and the stripe arrays > are stored in an instance of > +:cpp:class:`CacheHostRecord`. Assignment initialization consists of > populating the assignment array which is much larger > +than the stripe array. > + > +There is an instance of :cpp:class:`CacheHostRecord` for each line in > :file:`hosting.config` and one "generic" record. > +For the configured instances the set of stripes is determined from the cache > volume specified in the line. If no lines are specified all stripes are > placed in the generic record, otherwise only those stripes marked as default > are placed in the generic record. Line break at max 120, please. > + > +.. note:: If hosting records are specified it is an error to not specify at > least one default cache volume. > + > +The assignment table is initialized in :c:func:`build_vol_hash_table` which > is called for each > +:cpp:class:`CacheHostRecord` instance. For each strip in the host record a is this supposed to be "strip"? > sequence of pseudo-random numbers is > +generated, starting with the folded hash of the stripe hash identifier, > which is the device path followed by the skip > +and size values for that stripe, making it unique. This also makes the > sequence deterministic for any particular stripe. > +Each stripe gets one number in its sequence for every `VOL_HASH_ALLOC_SIZE` > (8 MB currently) of storage. These numbers are paired with > +the stripe index, combined across all stripes, then sorted by the random > values. The resulting array is sampled for > +every slot in the stripe assignment table by dividing the maximum random > value by the size of the assignment table and > +using the value midway between each multiple of the result of the division. > The coalesced psuedo-random sequence is > +scanned for each sample in turn and the first number not greater than the > sample is found. The stripe associated with > +that value is used for that assignment table entry. > + > +While this procedure is determinstic it is sensitive to initial conditions, > including the size of each stripe. > > .. rubric:: Footnotes > > > http://git-wip-us.apache.org/repos/asf/trafficserver/blob/0c953e12/doc/glossary.en.rst > ---------------------------------------------------------------------- > diff --git a/doc/glossary.en.rst b/doc/glossary.en.rst > index e103fee..94996db 100644 > --- a/doc/glossary.en.rst > +++ b/doc/glossary.en.rst > @@ -44,11 +44,34 @@ Glossary > > cache stripe > A homogenous persistent store for the cache in a single :term:`cache > span`. A stripe always resides > - entirely on a single physical device and is treated as an > undifferentiated span of bytes. > + entirely on a single physical device and is treated as an > undifferentiated span of bytes. This is the smallest > + independent unit of storage. > > cache span > The physical storage described by a single line in > :file:`storage.config`. > > + cache key > + A byte sequence that is a globally unique identifier for an > :term:`object <cache object>` in the cache. By default > + the URL for the object is used. > + > + cache ID > + A 128 bit value used as a fixed sized identifier for an object in the > cache. This is computed from the > + :term:`cache key` using the `MD5 hashing function > <http://www.openssl.org/docs/crypto/md5.html>`_. Lies :P > + > + cache tag > + The bottom few bits (12 currently) of the :term:`cache ID`. This is > used in the :ref:`cache directory <cache-directory>` > + for a preliminary identity check before going to disk. > + > + cache object > + The minimal self contained unit of data in the cache. Cache objects > are the stored version of equivalent content > + streams from an origin server. A single object can have multiple > variants called :term:`alternates <alternate>`. > + > + alternate > + A variant of a :term:`cache object`. This was originally created to > handle the `VARY mechanism > + <http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.44>`_ Should we reference HTTP-bis here? > but has since been used for additional > + purposes. All alternates of an object must be equivalent in some > manner, that is they are alternate forms of the > + same stream. The most common example is having normal and compressed > versions of the stream. > + > storage unit > Obsolete term for :term:`cache span`. > > @@ -59,3 +82,20 @@ Glossary > > write cursor > The location in a :term:`cache stripe` where new data is written. > + > + directory segment > + A contiguous group of :term:`buckets <directory bucket>`. Each > :term:`cache stripe` has a set of segments all of > + which have the same number of buckets, although the number of buckets > per segment can vary between cache stripes. > + Segments are administrative in purpose to minimize the size of free > list and hash bucket pointers. > + > + directory bucket > + A contiguous fixed sized group of :term:`directory entries <directory > entry>`. This is used for hash bucket > + maintenance optimization. > + > + directory entry > + An in memory entry that describes a :term:`cache fragment`. > + > + cache fragment > + The unit of storage in the cache. All reads from the cache always read > exactly one fragment. Fragments may be > + written in groups, but every write is always an integral number of > fragments. Each fragment has a corresponding > + :term:`directory entry` which describes its location in the cache > storage. > > -- Igor Galić Tel: +43 (0) 664 886 22 883 Mail: i.ga...@brainsware.org URL: http://brainsware.org/ GPG: 8716 7A9F 989B ABD5 100F 4008 F266 55D6 2998 1641