W dniu nie, 28.01.2018 o godzinie 21∶43 +0100, użytkownik Andrew Barchuk
napisał:
> [my apologies for posting the message to a wrong thread before]
> 
> Hi everyone,
> 
> > three possible solutions for splitting distfiles were listed:
> > 
> > a. using initial portion of filename,
> > 
> > b. using initial portion of file hash,
> > 
> > c. using initial portion of filename hash.
> > 
> > The significant advantage of the filename option was simplicity.  With
> > that solution, the users could easily determine the correct subdirectory
> > themselves.  However, it's significant disadvantage was very uneven
> > shuffling of data.  In particular, the TeΧ Live packages alone count
> > almost 23500 distfiles and all use a common prefix, making it impossible
> > to split them further.
> > 
> > The alternate option of using file hash has the advantage of having
> > a more balanced split.
> 
> 
> There's another option to use character ranges for each directory
> computed in a way to have the files distributed evenly. One way to do
> that is to use filename prefix of dynamic length so that each range
> holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
> texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
> but simpler option is to use file names as range bounds (the same way
> dictionaries use words to demarcate page bounds): each directory will
> have a name of the first file located inside. This way files will be
> distributed evenly and it's still easy to pick a correct directory where
> a file will be located manually.

What you're talking about is pretty much an adaptive algorithm. It may
look like a good at first but it's really hard to predict how it'll work
in the future because you can't really predict what will happen to
distfiles in the future.

A few major events that could result in it going competely off:

a. we stop using split texlive packages and distribute a few big
tarballs instead,

b. texlive packages are renamed to use date before subpackage name,

c. someone adds another big package set.

That said, you don't need a big event for that. Many small events may
(or may not) cause it to gradually go off. Whenever that happens, we
would have to have a contingency plan -- and I don't really like
the idea of having to reshuffle all the mirrors all of a sudden.

I think the cryptographic hash algorithms are a better choice. They may
not be perfect but they can cope with a lot of very different data
by design. Yes, we could technically accidentally hit a data set that is
completely uneven. But it is rather unlikely, compared to home-made
algorithms.

-- 
Best regards,
Michał Górny


Reply via email to