On 30.01.2020 17.04, Konstantin Ryabitsev wrote:
On Sun, 26 Jan 2020 at 11:19, Kristian Klausen via arch-mirrors
<arch-mirrors@archlinux.org> wrote:
I'm considering setting up a Arch Linux mirror and I'm considering a
different design.

So instead of mirroring the whole thing, the idea is to mirror only the
database files (core.db etc) and download the packages on demand from a
Tier 1 mirror (and let nginx cache them). By doing it that way, I only
download requested packages from the Tier 1 mirrors, instead of
downloading the whole thing (saving Tier 1 bandwidth).

To provide even better performance a CDN (ex: Cloudflare) could be used
to provide more caching. So we end up with a setup like this:
Cloudflare -> Nginx cache -> Tier1 mirrors (nginx with multiple upstream)

Do I miss something? Is this a bad idea?
If you are trying to save Tier1 some bandwidth, you'll probably
actually end up causing them more problems due to increased random
seek waits. Tier1 mirrors may not necessarily have fast storage -- for
example, all kernel.org mirror nodes have terabytes of spinning rust
and about half-a-TB of ssd used via lvm-cache. It works great for
Tier1 setups because most Tier2 mirrors want the same set of recent
updates that are served out of ssd cache. If a new mirror comes along
and wants to slurp and entire distro, that is fine too, because even
if there's higher iowait latency, the Tier2 mirror isn't working
against any HTTP timeouts or impatient clients and doesn't care if the
data arrives at a slower rate due to higher iowait. Tier1 can also
tell Tier2 mirror "I'm overloaded right now, please try again later"
and it'll be fine as most Tier2 mirrors can wait an hour or two before
receiving updates.

Making Tier1 mirrors a "cold cache" for your setup will likely cause
more disk thrash for them, but will also result in poorer service for
people using your mirror due to the reasons I listed above.

Tier 1 mirrors is also used directly by end-users (correct me if I'm wrong)
So worst-case (cache miss) my SSD-backed shared cache won't be noticeable slower than pulling directly from the Tier 1 mirror. Best-case (cache hit) I'm saving the Tier 1 mirror some bandwidth and disk usage. My idea is basically tiered caching (CDN -> Nginx SSD-backed shared cache -> Tier 1 mirror(s)), is that worse than status quo? :)

If someone
tries to install a package and watches their download bar sit at 0 for
half a minute due to backend proxies fetching data from Tier1 origin,
that's going to result in frustrated people.

Nginx streams the data as it is received from the upstream server, so worst-case (cache miss) the data can be delivered as fast as received from the upstream server.

TL;DR: If you can afford CDN-fronting your mirror, that should be
mostly fine, but I would recommend against using Tier1 as your
cache-miss backend. Storage is cheap and most Tier1 mirrors have
unlimited bandwidth, so just run a Tier2 mirror (with slow/fast
storage caching) and keep local copies of everything.

-K
(mirrors.kernel.org administrator)

Reply via email to