On Sat, Jan 27, 2018 at 12:24:39AM +0100, Michał Górny wrote:
> Migrating mirrors to the hashed structure
> -----------------------------------------

> The hard link solution allows us to save space on the master mirror.
> Additionally, if ``-H`` option is used by the mirrors it avoids
> transferring existing files again.  However, this option is known
> to be expensive and could cause significant server load.  Without it,
> all mirrors need to transfer a second copy of all the existing files.
> 
> The symbolic link solution could be more reliable if we could rely
> on mirrors using the ``--links`` rsync option.  Without that, symbolic
> links are not transferred at all.

These rsync options might help for mirrors too:
     --compare-dest=DIR      also compare destination files relative to DIR
     --copy-dest=DIR         ... and include copies of unchanged files
     --link-dest=DIR         hardlink to files in DIR when unchanged

> Using hashed structure for local distfiles
> ------------------------------------------
> The hashed structure defined above could also be used for local distfile
> storage as used by the package manager.  For this to work, the package
> manager authors need to ensure that:
> 
> a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
>    directory where distfiles specific to the package are linked
>    in a flat structure.
> 
> b. All tools are updated to support the nested structure.
> 
> c. The package manager provides a tool for users to easily manipulate
>    distfiles, in particular to add distfiles for fetch-restricted
>    packages into an appropriate subdirectory.
> 
> For extended compatibility, the package manager may support finding
> distfiles in flat and nested structure simultaneously.

trying nested first then falling back to flat would make it easy for
users if they have to download distfiles for fetch-restricted packages
because then the instructions stay as "move it to
/usr/portage/distfiles".
or alternatively the tool could have a mode which will go through all
files in the base dir and move it to where it should be in the nested
tree. then you save everything to the same dir and run edist --fix

> Rationale
> =========
> Algorithm for splitting distfiles
> ---------------------------------
> In the original debate that occurred in bug #534528 [#BUG534528]_,
> three possible solutions for splitting distfiles were listed:
> 
> a. using initial portion of filename,
> 
> b. using initial portion of file hash,
> 
> c. using initial portion of filename hash.
> 
> The significant advantage of the filename option was simplicity.  With
> that solution, the users could easily determine the correct subdirectory
> themselves.  However, it's significant disadvantage was very uneven
> shuffling of data.  In particular, the TeΧ Live packages alone count
> almost 23500 distfiles and all use a common prefix, making it impossible
> to split them further.

the filename is the original upstream or the renamed one? eg
SRC_URI="http://foo/foo.tar -> bar.tar" it will be bar.tar?

I think im in favour of using the initial part of the filename anyway.
sure its not balanced but its still a hell of a lot more balanced than
today and its really easy.

Another thing im wondering is if we can just use the same dir layout as
the packages themselves. that would fix texlive since it has a whole lot
of separate packages. eg /usr/portage/distfiles/app-cat/pkg/pkg-1.0.tgz

there is a problem if many packages use the same distfiles (quite
extensive for SELinux, every single of the sec-policy/selinux-* packages
has identical distfiles) so im not sure how to deal with it.

this would also make it easy in future to make the sandbox restrict
access to files outside of that package if we wanted to do that.

> The alternate option of using file hash has the advantage of having
> a more balanced split.  Furthermore, since hashes are stored
> in Manifests using them is zero-cost.  However, this solution has two
> significant disadvantages:
> 
> 1. The hash values are unknown for newly-downloaded distfiles, so
>    ``repoman`` (or an equivalent tool) would have to use a temporary
>    directory before locating the file in appropriate subdirectory.
> 
> 2. User-provided distfiles (e.g. for fetch-restricted packages) with
>    hash mismatches would be placed in the wrong subdirectory,
>    potentially causing confusing errors.

Not just this, but on principle, I also think you should be able to read
an ebuild and compute the url to download the file from the mirrors
without any extra knowledge (especially downloading the distfile).

> Using filename hashes has proven to provide a similar balance
> to using file hashes.  Furthermore, since filenames are known up front
> this solution does not suffer from the both listed problems.  While
> hashes need to be computed manually, hashing short string should not
> cause any performance problems.
> 
> .. figure:: glep-0075-extras/by-filename.png
> 
>    Distribution of distfiles by first character of filenames
> 
> .. figure:: glep-0075-extras/by-csum.png
> 
>    Distribution of distfiles by first hex-digit of checksum
>    (x --- content checksum, + --- filename checksum)
> 
> .. figure:: glep-0075-extras/by-csum2.png
> 
>    Distribution of distfiles by two first hex-digits of checksum
>    (x --- content checksum, + --- filename checksum)

do you have an easy way to calculate how big the distfiles are per
category or cat/pkg? i'd be interested to see.

> Backwards Compatibility
> =======================
> Mirror compatibility
> --------------------
> The mirrored files are propagated to other mirrors as opaque directory
> structure.  Therefore, there are no backwards compatibility concerns
> on the mirroring side.
> 
> Backwards compatibility with existing clients is detailed
> in `migrating mirrors to the hashed structure`_ section.  Backwards
> compatibility with the old clients will be provided by preserving
> the flat structure during the transitional period.

Even if there was no transition, things wouldnt be terrible because
portage would fall back to just downloading from SRC_URI directly
if the mirrors fail.


Reply via email to