On 08/05/2023 14.03, Michał Górny wrote:
On Mon, 2023-05-08 at 09:53 +0200, Florian Schmaus wrote:Furthermore, both numbers, 256 MiB and 410 MiB, are based on the over-approximation that every EGO_SUM package uses 1.6 MiB, which is almost certainly not the case. The mean package-directory size of a EGO_SUM using package at 2022-02-16 was 280 KiB.Please extend this analysis to Manifest changes over time, and how they are going to impact total gentoo.git size.
Gladly.The average daily change caused by Manifests of EGO_SUM packages from 2020-02-16 to 2022-02-16 was at most 80 KiB. (See below for the methodology used to obtain this number.)
In other words, a daily syncing user had at most 80 KiB traffic on average per day to sync the Manifests of all EGO_SUM that existed on 2022-02-16.
Even in lesser developed regions of the world, 80 KiB a day are manageable. And, this would still be the case if we double, quadruple or octuple this number.
I note that this number does not include ebuilds and metadata. However, one can easily over-approximate that the additional ebuilds and metadata delta, that comes with the observed Manifest changes, is smaller than the Manifest changes themselves. Therefore, a pessimistic approximation is twice 80 KiB.
But then again, the 80 KiB are not considering transport compression. And, as we have learned, Manifests roughly compress to 50% of their original size. So the average EGO_SUM-generated network traffic, assuming that it is compressed, remains in the region of hundred kilobytes per day.
We can also use this number to over-approximate the growth rate of gentoo.git due to EGO_SUM.
Assume that 120 EGO_SUM packages cause a daily growth rate of 160 KiB, that is 2x 80 KiB and the number we have used above. Doubling this number would yield the estimated rate of the current number of Go packages in ::gentoo. This rate amounts to 320 KiB daily, increasing gentoo.git by 114 MiB per year. Please double this number for a bit of future safety.
In summary, this and the previous analysis finds not data-size-based arguments against EGO_SUM's usage.
Using EGO_SUM is fine for users and developers. The ::gentoo increase, even if it would quadruple the current size, does not entail any issues. The expected average daily delta that EGO_SUM would cause today is also no threat, even for users with low-bandwidth connections. The size increase which EGO_SUM causes to gentoo.git is also within manageable bounds. If an ebuild developer has 1-2 gigabytes free on their disk, they will not need to buy a larger disk in the coming years if we start using EGO_SUM again in ::gentoo.
- Flow # Appendix: MethodologyWe took gentoo.git at 2022-02-16 at the commit 60dc7a03ff2f. From there, we created the numstat log (git log --numstat) of each Manifest of every EGO_SUM package. We configured the numstat log to go back at most two years in time, that is, till 2020-02-16. The numstat log contains the changed lines (added/removed) of the Manifest in the target period. An awk script calculated the total sum of added and removed lines. Note that this treats removed lines equal to added lines, even though the removed lines should cause significantly less network traffic. We also extracted the date of the oldest commit in the observed period. This date was used to calculate the total number of days in the period, which accounts for packages that came to life after 2020-02-16 and would otherwise skew the analysis towards smaller results.
Dividing the total number of changed lines by the number of days yields the average number of lines changed per day per package.
We further determined the worst-observed line length of EGO_SUM packages manifests, which was 404 bytes.
Summarizing the average number of lines changed over all packages yielded 195.58093724672614. Multiplying this number by the maximal observed line length of 404 bytes gives 79014.69 bytes per day or, in other words, roughly 80 KiB per day.
The raw and post-processed results of this analysis are available at https://dev.gentoo.org/~flow/gentoo-tree-analysis-results/2023-05-17T100838-gentoo-at-2022-02-16-60dc7a03ff2f/ The code used to carry out this analysis is available at https://gitlab.gentoo.org/flow/gentoo-tree-analysisfor everyone to study the code, reproduce the results, and check for issues and bugs.
As always, I appreciate any feedback.
OpenPGP_0x8CAC2A9678548E35.asc
Description: OpenPGP public key
OpenPGP_signature
Description: OpenPGP digital signature