On Mon, Jan 29, 2018 at 3:26 PM, R0b0t1 <r03...@gmail.com> wrote: > On Monday, January 29, 2018, Michał Górny <mgo...@gentoo.org> wrote: > > Here's an updated version. I've tried to incorporate most > > of the feedback so far. > > > > > > --- > > GLEP: 75 > > Title: Split distfile mirror directory structure > > Author: Michał Górny <mgo...@gentoo.org>, > > Robin H. Johnson <robb...@gentoo.org> > > Type: Standards Track > > Status: Draft > > Version: 1 > > Created: 2018-01-26 > > Last-Modified: 2018-01-27 > > Post-History: 2018-01-27 > > Content-Type: text/x-rst > > --- > > > > Abstract > > ======== > > This GLEP describes the procedure for splitting the distfiles on mirrors > > into multiple directories with the goal of reducing the number of files > > in a single directory. > > > > > > Motivation > > ========== > > At the moment, both the package manager and Gentoo mirrors use flat > > directory structure to store files. While this solution usually works, > > it does not scale well. Directories with large number of files usually > > have significant performance penalty, unless using filesystems > > specifically designed for that purpose. > > > > According to the Gentoo repository state at 2018-01-26 16:23, there > > was a total of 62652 unique distfiles in the repository. While > > the users realistically hit around 10% of that, distfile mirrors often > > hold even more files --- more so if old distfiles are not wiped > > immediately. > > > > While all filesystems used on Linux boxes should be able to cope with > > a number that large, they may suffer a performance penalty with even > > a few thousand files. Additionally, if mirrors enable directory indexes > > then generating the index imposes both a significant server overhead > > and a significant data transfer. At this moment, the index > > of distfiles.gentoo.org has around 17 MiB. > > > > Splitting the distfiles into multiple directories makes it possible > > to avoid those problems by reducing the number of files in a single > > directory. For example, splitting the forementioned set of distfiles > > into 16 directories that are roughly balanced allows to reduce > > the number of files in a single directory to around 4000. Splitting > > them further into 256 directories (16x16) results in 200-300 files > > per directory which should avoid any performance problems long-term, > > even assuming 300% growth of number of distfiles. > > > > > > Specification > > ============= > > Mirror layout file > > ------------------ > > A mirror adhering to this specification should include a ``layout.conf`` > > file in the top distfile directory. This file uses the format > > derived from the freedesktop Desktop Entry Specification file format > > [#DESKTOP_FORMAT]_. > > > > Before using each Gentoo mirror, the package manager should attempt > > to fetch (update) its ``layout.conf`` file and process it to determine > > how to use the mirror. If the file is not present, the package manager > > should behave as if it were empty. > > > > The package manager should recognize the sections and keys listed below. > > It should ignore any unrecognized sections or keys --- the format > > is intended to account for future extensions. > > > > This specification currently defines one section: ``[structure]``. > > This section defines one or more repository structure definitions > > using non-negative sequential integer keys. The definition with > > the ``0`` key is the most preferred structure. The package manager > > should ignore any formats it does not recognize. If this section > > is not present, the package manager should behave as if only ``flat`` > > structure were specified. > > > > The following structure definitions are supported: > > > > * ``flat`` to indicate the traditional flat structure where all > > distfiles are located in the top directory, > > > > * ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename > > hash structure`_ explained below. > > > > > > Filename hash structure > > ----------------------- > > When using the filename hash structure, the distfiles are split > > into directories whose names are derived from the hash of distfile > > filename. This structure has two parameters: *algorithm name* > > and *cutoffs* list. > > > > The algorithm name must correspond to a valid Manifest hash name. > > An informational list of hashes is included in GLEP 74 [#GLEP74]_, > > and the policies for introducing new hashes are covered by GLEP 59 > > [#GLEP59]_. > > > > The cutoffs list specifies one or more integers separated by colons > > (``:``), indicating the number of bits (starting with the most > > significant bit) of the hash used to form subsequent subdirectory names. > > For example, the list of ``2:4`` would indicate that top-level directory > > names are formed using 2 most significant bits of the hash (resulting > > in 2² = 4 directories), and each of this directories would have > > subdirectories formed using the next 4 bits of the hash (resulting > > in 2⁴ = 8 subdirectories each). > > > > The exact algorithm for determining the distfile location follows: > > > > 1. Let the distfile filename be **F**. > > > > 2. Compute the hash of **F** and store its binary value as **H**. > > > > 3. For each integer **C** in cutoff list: > > > > a. Take **C** most significant bits of **H** and store them as **V**. > > > > b. Convert **V** into hexadecimal integer, left padded with zeros > > to **C/4** digits (rounded up) and append it to the path, followed > > by the path separator. > > > > c. Shift **H** left **C** bits. > > > > 4. Finally, append **F** to the obtained path. > > > > In particular, note that when using nested directories > > the subdirectories do not repeat the hash bits used in parent directory. > > > > > > Migrating mirrors to the hashed structure > > ----------------------------------------- > > Since all distfile mirrors sync to the master Gentoo mirror, it should > > be enough to perform all the needed changes on the master mirror > > and wait for other mirrors to sync. The following procedure > > is recommended: > > > > 1. Include the initial ``layout.conf`` listing only ``flat`` layout. > > > > 2. Create the new structure alongside the flat structure. Wait for > > mirrors to sync. > > > > 3. Once all mirrors receive the new structure, update ``layout.conf`` > > to list the ``filename-hash`` structure. > > > > 4. Once a version of Portage supporting the new structure is stable long > > enough, remove the fallback ``flat`` structure from ``layout.conf`` > > and duplicate distfiles. > > > > This implies that during the migration period the distfiles will > > be stored duplicated on the mirrors and therefore will occupy twice > > as much space. Technically, this could be avoided either by using > > hard links or symbolic links. > > > > The hard link solution allows us to save space on the master mirror. > > Additionally, if ``-H`` option is used by the mirrors it avoids > > transferring existing files again. However, this option is known > > to be expensive and could cause significant server load. Without it, > > all mirrors need to transfer a second copy of all the existing files. > > > > The symbolic link solution could be more reliable if we could rely > > on mirrors using the ``--links`` rsync option. Without that, symbolic > > links are not transferred at all. > > > > > > Using hashed structure for local distfiles > > ------------------------------------------ > > The hashed structure defined above could also be used for local distfile > > storage as used by the package manager. For this to work, the package > > manager authors need to ensure that: > > > > a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary > > directory where distfiles specific to the package are linked > > in a flat structure. > > > > b. All tools are updated to support the nested structure. > > > > c. The package manager provides a tool for users to easily manipulate > > distfiles, in particular to add distfiles for fetch-restricted > > packages into an appropriate subdirectory. > > > > For extended compatibility, the package manager may support finding > > distfiles in flat and nested structure simultaneously. > > > > > > Rationale > > ========= > > Algorithm for splitting distfiles > > --------------------------------- > > The possible algorithms were considered with the following goals > > in mind: > > > > - the number of files in a single directory should not exceed 1000, > > > > - the total size of files in a single directory is not considered > > relevant, > > > > - the solution should preferably be future-proof, > > > > - moving distfiles should be avoided once it is deployed. > > > > It should also be noted that at this moment the package having most > > distfiles in Gentoo at the time is dev-texlive/texlive-latexextra, > > with the number of 8556 distfiles. All of them start with a common > > prefix of ``texlive-module-``. This specific prefix is used by a total > > of 23435 distfiles. > > > > In the original debate that occurred in bug #534528 [#BUG534528]_ > > and the mailing list review of the initial version of this GLEP [#ML1]_, > > four fundamental ideas for splitting distfiles were listed: > > > > a. using initial portion of filename, > > > > b. using initial portion of file hash, > > > > c. using initial portion of filename hash, > > > > d. using package category (and package name). > > > > The initial filename idea was to use the first character of filename, > > possibly followed by a longer part which was the idea historically > > used e.g. by PyPI Python package hosting. Its main advantage is > > simplicity. The users can easily determine the correct subdirectory > > by just looking at the distfile name. Sadly, this solution is not only > > very uneven but does not solve the problem. As mentioned above, > > the TeΧ Live packages share a long common prefix that make it impossible > > to split it properly with other packages on fixed-length prefixes. > > > > This idea has been followed by an adaptive proposal by Andrew Barchuk > > [#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly > > mapped to groups by a common prefix but instead each group contains > > all files between two prefixes being used (like in a dictionary). > > However, it has been pointed out that while this option can provide > > very even results initially, it is impossible to predict how it would > > be affected by future distfile changes and there will be a risk of > > needing to change the groups in the future. Furthermore, it is > > relatively complex and requires explicitly listing or obtaining used > > groups. > > > > Another option was to use an initial portion of distfile hashes. Its > > main advantage is that cryptographic hash algorithms can provide > > a more balanced split with random data. Furthermore, since hashes are > > stored in Manifests using them has no cost for users. However, this > > solution has three disadvantages: > > > > 1. Not all files in the distfile tree are covered by package Manifests. > > Additional files are injected into the mirrors, and those will > > not have a clearly-defined location. > > > > 2. User-provided distfiles (e.g. for fetch-restricted packages) with > > hash mismatches would be placed in the wrong subdirectory, > > potentially causing confusing errors. > > > > 3. The hash values are unknown for newly-downloaded distfiles, so > > ``repoman`` (or an equivalent tool) would have to use a temporary > > directory before locating the file in appropriate subdirectory. > > > > Using filename hashes has proven to provide a similar balance to using > > file hashes. Furthermore, since filenames are known up front this > > solution does not suffer from the listed problems. While hashes need > > to be computed manually, hashing short string should not cause > > any performance problems. > > > > Jason Zaman has suggested to use package categories (and package names) > > [#PKGNAME]_. However, this solution has multiple problems: > > > > a. it does not solve the problem for large packages such as TeΧ Live, > > > > b. it introduces many unnecessarily small directories, > > > > c. it requires an explicit knowledge of which package distfiles > > belong to, > > > > d. it does not provide an explicit solution to the problem of distfiles > > shared by multiple packages, > > > > e. it does not provide a solution to the problem of injected distfiles. > > > > All the options considered, the filename hash solution was selected > > as one that solves all the forementioned problems while introducing > > relatively low complexity and being reasonably future-proof. > > > > .. figure:: glep-0075-extras/by-filename.png > > > > Distribution of distfiles by first character of filenames > > (note: y axis is on log scale) > > > > .. figure:: glep-0075-extras/by-csum.png > > > > Distribution of distfiles by first hex-digit of checksum > > (x --- content checksum, + --- filename checksum) > > > > .. figure:: glep-0075-extras/by-csum2.png > > > > Distribution of distfiles by two first hex-digits of checksum > > (x --- content checksum, + --- filename checksum) > > > > > > Layout file > > ----------- > > The presence of control file has been suggested in the original > > discussion. Its main purpose is to let package managers cleanly handle > > the migration and detect how to correctly query the mirrors throughout > > it. Furthermore, it makes future changes easier. > > > > The format lines specifically mean to hardcode as little about > > the actual algorithm as possible. Therefore, we can easily change > > the hash used or the exact split structure without having to update > > the package managers or even provide a compatibility layout. > > > > The file is also open for future extensions to provide additional mirror > > metadata. However, no clear use for that has been determined so far. > > > > > > Hash algorithm > > -------------- > > The hash algorithm support is fully deferred to the existing code > > in the package managers that is required to handle Manifests. > > In particular, it is recommended to reuse one of the hashes that are > > used in Manifest entries at the time. This avoids code duplication > > and reuses an existing mechanism to handle hash upgrades. > > > > During the discussion, it has been pointed that this particular use case > > does not require a cryptographically strong hash and a faster algorithm > > could be used instead. However, given the short length of hashed > > strings performance is not a problem, and speed does not justify > > the resulting code duplication. > > > > It has also been pointed out that e.g. the BLAKE2 hash family provides > > the ability of creating arbitrary length hashes instead of truncating > > the standard-length hash. However, not all implementations of BLAKE2 > > support that and relying on it could reduce portability for no apparent > > gain. > > > > > > Backwards Compatibility > > ======================= > > Mirror compatibility > > -------------------- > > The mirrored files are propagated to other mirrors as opaque directory > > structure. Therefore, there are no backwards compatibility concerns > > on the mirroring side. > > > > Backwards compatibility with existing clients is detailed > > in `migrating mirrors to the hashed structure`_ section. Backwards > > compatibility with the old clients will be provided by preserving > > the flat structure during the transitional period. > > > > The new clients will fetch the ``layout.conf`` file to avoid backwards > > compatibility concerns in the future. In case of hitting an old mirror, > > the package manager will default to the ``flat`` structure. > > > > > > Package manager storage compatibility > > ------------------------------------- > > The exact means of preserving backwards compatibility in package manager > > storage are left to the package manager authors. However, it is > > recommended that package managers continue to support the flat layout > > even if it is no longer the default. The package manager may either > > continue to read files from this location or automatically move them > > to an appropriate subdirectory. > > > > > > Reference Implementation > > ======================== > > TODO. > > > > > > References > > ========== > > .. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the > file > > (https://standards.freedesktop.org/desktop-entry- > spec/latest/ar01s03.html) > > > > .. [#GLEP74] GLEP 74: Full-tree verification using Manifest files: > > Checksum algorithms (informational) > > (https://www.gentoo.org/glep/glep-0074.html#checksum- > algorithms-informational) > > > > .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications > > (https://www.gentoo.org/glep/glep-0059.html) > > > > .. [#BUG534528] Bug 534528 - distfiles should be sorted into > subdirectories > > of DISTDIR > > (https://bugs.gentoo.org/534528) > > > > .. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory > structure > > (https://archives.gentoo.org/gentoo-dev/message/ > cfc4f8595df2edf9a25ba9ecae2463ba) > > > > .. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges > > for each directory computed in a way to have the files distributed > evenly' > > (https://archives.gentoo.org/gentoo-dev/message/ > 611bdaa76be049c1d650e8995748e7b8) > > > > .. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout > > as the packages themselves) > > (https://archives.gentoo.org/gentoo-dev/message/ > f26ed870c3a6d4ecf69a821723642975) > > > > > > Copyright > > ========= > > This work is licensed under the Creative Commons Attribution-ShareAlike > 3.0 > > Unported License. To view a copy of this license, visit > > http://creativecommons.org/licenses/by-sa/3.0/. > > > > -- > > Best regards, > > Michał Górny > > > > It's going to be hash based? Why? I tried to follow the conversation but > there's now close to 5 of these posts in the mailing list with different > conversations in each. >
The reasoning is embedded in the proposal; search for "Algorithm for splitting distfiles". Read that section. I think it summarizes the space well. -A > > Using filename prefixes is boring and not uniform, but I feel I should > point out that most distfile hosts are still doing fine. Microoptimizing > this seems like wasted effort. > > Cheers, > R0b0t1