W dniu nie, 28.01.2018 o godzinie 21∶43 +0100, użytkownik Andrew Barchuk napisał: > [my apologies for posting the message to a wrong thread before] > > Hi everyone, > > > three possible solutions for splitting distfiles were listed: > > > > a. using initial portion of filename, > > > > b. using initial portion of file hash, > > > > c. using initial portion of filename hash. > > > > The significant advantage of the filename option was simplicity. With > > that solution, the users could easily determine the correct subdirectory > > themselves. However, it's significant disadvantage was very uneven > > shuffling of data. In particular, the TeΧ Live packages alone count > > almost 23500 distfiles and all use a common prefix, making it impossible > > to split them further. > > > > The alternate option of using file hash has the advantage of having > > a more balanced split. > > > There's another option to use character ranges for each directory > computed in a way to have the files distributed evenly. One way to do > that is to use filename prefix of dynamic length so that each range > holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but > texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar > but simpler option is to use file names as range bounds (the same way > dictionaries use words to demarcate page bounds): each directory will > have a name of the first file located inside. This way files will be > distributed evenly and it's still easy to pick a correct directory where > a file will be located manually.
What you're talking about is pretty much an adaptive algorithm. It may look like a good at first but it's really hard to predict how it'll work in the future because you can't really predict what will happen to distfiles in the future. A few major events that could result in it going competely off: a. we stop using split texlive packages and distribute a few big tarballs instead, b. texlive packages are renamed to use date before subpackage name, c. someone adds another big package set. That said, you don't need a big event for that. Many small events may (or may not) cause it to gradually go off. Whenever that happens, we would have to have a contingency plan -- and I don't really like the idea of having to reshuffle all the mirrors all of a sudden. I think the cryptographic hash algorithms are a better choice. They may not be perfect but they can cope with a lot of very different data by design. Yes, we could technically accidentally hit a data set that is completely uneven. But it is rather unlikely, compared to home-made algorithms. -- Best regards, Michał Górny