Here are IRC logs from a discussion that has taken place about this topic. Summary: - QEMU has ~$500/month Azure credits available that could be used for CI - Burstable VMs like Azure AKS nodes seem like a good strategy in order to minimize hosting costs and parallelize when necessary to keep CI duration low - Paolo is asking for someone from Red Hat to dedicate the time to set up Azure AKS with GitLab CI - Personally, I don't think this should exclude other efforts like Eldon's. We can always add more private runners!
11:31 < pm215> does anybody who understands how the CI stuff works have the time to take on the task of getting this all to work with either (a) a custom runner running on one of the hosts we've been offered or (b) whatever it needs to run using our donated azure credits ? 11:34 < danpb> i would love to, but i can't volunteer my time right now :-( 11:34 < stefanha> Here is the email thread for anyone else who's trying to catch up (like me): https://lore.kernel.org/qemu-devel/cafeaca83u_enxdj3gjka-xv6eljgjpr_9frdkaqm3qacyhr...@mail.gmail.com/ 11:34 -!- iggy [~iggy@47.152.10.131] has quit [Quit: WeeChat 3.5] 11:35 -!- peterx_ is now known as peterx 11:35 < danpb> what paolo suggested about using the Kubernetes runners for Azure seems like the ideal approach 11:35 -!- peterx [~x...@bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca] has quit [Quit: WeeChat 3.6] 11:35 -!- peterx [~x...@bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca] has joined #qemu 11:36 < danpb> as it would be most cost effective in terms of Azure resources consumed, and would scale well as it would support as many parallel runners as we can afford from our Azure allowance 11:36 < danpb> but its also probably ythe more complex option to setup 11:36 -!- dramforever [~dramforev@59.66.131.33] has quit [Ping timeout: 480 seconds] 11:37 < stefanha> It's a little sad to see the people who are volunteering to help being ignored in the email thread. 11:37 < stefanha> Ben Dooks asked about donating minutes (the easiest solution). 11:38 -!- sgarzare [~sgarz...@c-115-213.cust-q.wadsl.it] has quit [Remote host closed the connection] 11:38 < peterx> jsnow[m]: an AHCI question: where normally does the address in ahci_map_fis_address() (aka, AHCIPortRegs.is_addr[_hi]) reside? Should that be part of guest RAM that the guest AHCI driver maps? 11:38 < stefanha> eldondev is in the process of setting up a runner. 11:38 < pm215> stefanha: the problem is there is no one person who has all of (a) the authority to do stuff (b) the knowledge of what the right thing to do is (c) the time to do it... 11:38 < danpb> stefanha: the challenge is how to accept the donation in a sustainable way 11:39 < th_huth> stefanha: Do you know whether we still got that fosshost server around? ... I know, fosshost is going away, but maybe we could use it as a temporary solution at least 11:39 < stefanha> th_huth: fosshost, the organization, has ceased to operate. 11:39 < danpb> fosshost ceased operation as a concept 11:39 < davidgiluk> I wonder if there's a way to make it so that those of us with hardware can avoid eating into the central CI count 11:39 < danpb> if we still have any access to a machine its just by luck 11:39 < stefanha> th_huth: QEMU has 2 smallish x86 VMs ready to at Oregon State University Open Source Lab. 11:40 < stefanha> (2 vCPUs and 4 GB RAM, probably not enough for private runners) 11:40 -!- amorenoz [~amorenoz@139.47.72.25] has quit [Read error: Connection reset by peer] 11:41 < peterx> jsnow[m]: the context is that someone is optimizing migration by postponing all memory updates to after some point, and the AHCI post_load is not happy because it cannot find/map the FIS address here due to delayed memory commit() (https://pastebin.com/kADnTKzp). However when I look at that I failed to see why if it's in normal RAM (but I think I had somewhere wrong) 11:41 < danpb> stefanha: fyi gitlab's shared runners are 1 vCPU, 3.75 GB of RAM by default 11:41 < stefanha> Gerd Hoffmann seems to have a hands-off/stateless private runner setup that can be scaled to multiple machines. 11:41 < stefanha> danpb: :) 11:41 < danpb> stefanha: so those two small VMs are equivalent to 2 runners, and with 30-40 jobs we run in parallel 11:41 < stefanha> The first thing that needs to be decided is which approach to take: 11:41 < jsnow[m]> peterx: I'm rusty but I think so. the guest writes an FIS (Frame Information Structure) for the card to read and operate on 11:41 < danpb> stefanha: just having two runners is going to make our pipelines take x20 longer to finish 11:41 < stefanha> 1. Donating more minutes to GitLab CI 11:42 < stefanha> 2. Using Azure to spawn runners 11:42 < stsquad> stefanha I think the main bottleneck is commissioning and admin'ing machines - but we have ansible playbooks to do it for our other custom runners so it should "just" be a case of writing one for an x86 runner 11:42 < stefanha> 3. Hosting runners on servers 11:42 < danpb> that's why i think the Azure k8s executor sounded promising - it would burst upto 20-30 jobs in parallel for the short time we run CI 11:42 < danpb> without us having to pay for 30 vms 24x7 11:42 < stsquad> who actually understands configuring k8s? 11:42 < th_huth> stefanha: I read that fosshost announcement that they will be going away ... not that they have already terminated everything ... but sure, it's not something sustainable 11:42 < stefanha> stsquad: Yep, kraxel's approach solves that because it's stateless/automated. 11:43 < peterx> jsnow[m]: thanks, then let me double check 11:43 < jsnow[m]> peterx: iirc the guest writes the address of the FIS to a register and then the pci card maps that address to read the larger command structure 11:43 < danpb> stsquad (@_oftc_stsquad:matrix.org): i don't think we'd need to configure k8s itself, just figure out how to point gitlab runner as the azure k8s service 11:44 < stefanha> danpb: Has someone calculated the cost needed to run QEMU CI on Azure? It's great that we can burst it when needed, but will the free Azure quota be enough? 11:44 -!- mmu_man [~re...@188410969.box.freepro.com] has joined #qemu 11:44 < danpb> stsquad: the problem with our current ansible playbooks is that none of them used docker AFAIR, they just setup the gitlab runer as bare metal 11:44 < peterx> jsnow[m]: yes, the thing is if that's the case the RAM should have been there when post_load() even without commit anything, so maybe there's something else I missed 11:44 < stefanha> i.e. will we just hit a funding wall again but on Azure instead of on GitLab? 11:44 < danpb> stefanha don't think anyone's calculated it, would hafve to ask bonzini what we actually get access to 11:45 < danpb> what would help is that we would not need azure for the whole month 11:45 < danpb> we would onl need it to fill in the gap when gitlab allowance is consumed 11:45 < stsquad> danpb I'm sure they can be re-written - I can't recall what stopped us using docker in the first place 11:46 < stsquad> but I'm a little wary of experimenting on the live CI server 11:46 < danpb> they wanted to run avocado tests which utilize some bare metal features 11:46 < stsquad> ahh that would be it 11:46 < stsquad> access to /dev/kvm 11:46 < danpb> i suggested that we set it up to expose KVM etc to the container but it wasn't done that way :-( 11:49 < stefanha> danpb: A simple estimate would be: "QEMU uses 50k CI minutes around the 20th of each month, so thats 50/20 * 10 more days = 25k CI minutes needed to cover those last 10 days" 11:49 < stefanha> Assuming GitLab CI minutes are equivalent to Azure k8s minutes 11:50 < stefanha> and then multiply 25k minutes by the Azure instance price rate. 11:51 < stefanha> ISTR the Azure quota is manually renewed by bonzini[m]. It may have been something like $10k and we use $2k of it for non-CI stuff at the moment. 11:52 < stefanha> I'm not sure if the $10k is renewed annually or semi-annually. 11:52 -!- genpaku_ [~genpaku@107.191.100.185] has quit [Read error: Connection reset by peer] 11:52 < stefanha> So maybe $8k available per year. 11:52 < dwmw2_gone> I feel I ought to be able to round up some VM instances too. 11:53 -!- farosas [~farosas@177.103.113.244] has quit [Quit: Leaving] 11:53 -!- farosas [~farosas@177.103.113.244] has joined #qemu 11:54 < bonzini> stefanha: right, more like $3k to be safe 11:54 < bonzini> dwmw2_gone: the right thing to do would be to set up kubernetes/Fargate 11:54 < bonzini> same for Azure 11:54 -!- zzhu [~z...@072-182-049-214.res.spectrum.com] has quit [Remote host closed the connection] 11:55 < bonzini> dwmw2_gone: because what we really need is beefy VMs (let's say 10*16 vCPU) for a few hours a week, not something 24/7 11:55 < bonzini> the Azure and AWS estimators both gave ~1000$/year 11:56 -!- genpaku [~genpaku@107.191.100.185] has joined #qemu 11:57 < dwmw2_gone> I have "build scripts" which launch an instance, do the build there, terminate it. Why would you need anything 24/7? :) 11:57 < dwmw2_gone> I abuse some of our test harnesses for builds 11:57 < dwmw2_gone> You can have bare metal that way, and actually get KVM. 11:58 < bonzini> dwmw2_gone: 24/7 because that's what the gitlab runners want (unless you put them on kubernetes) 11:59 -!- vliaskov [~vlias...@dynamic-077-191-055-225.77.191.pool.telefonica.de] has quit [Remote host closed the connection] 11:59 < dwmw2_gone> Ah. Unless the gitlab runners just spawned the instance to do the test, and waited for it. They don't use many CPU minutse that way. 11:59 -!- bolt [~r...@000182e9.user.oftc.net] has quit [Ping timeout: 480 seconds] 12:00 < stefanha> 25k mins / 60 minutes/hour = 417 hours/month @ AKS node hourly price $0.077 = $32 month (!) 12:00 < bonzini> stefanha: danpb: i think spending 250-500 $ on GitLab CI while we set up Azure in the next couple months is workable 12:00 < stefanha> That's with small nodes similar to GitLab CI runners 12:00 < danpb> bonzini: unless we're trying to get the pipeline wallclock time shorter, we don't need really beefy VMs - gitlabs runners are quite low resources, we just use a lot in parallel 12:01 < bonzini> danpb: 10*16 vCPUs cost less than 80*2 vCPUs anyway 12:01 < stefanha> It seems the Azure quota will be fine 12:01 < stefanha> Hmm...actually I think I'm underestimating the number of instances and their size. 12:01 < danpb> bonzini i guess RAM is probably their dominating cost factor for VMs rather than CPUs 12:02 < bonzini> danpb: a bit of both 12:03 < danpb> stefanha: don't forget that our gitlab CI credits don't reflect wallclock time - there's a 0.5 cost factor - so our 50,000 credits == 100,000 wallclock minutes per month 12:03 -!- Moot [~Moo99@185.247.84.132] has quit [Read error: Connection reset by peer] 12:03 -!- bkircher [~bk@2001:a61:251f:7001:8aae:ddff:fe01:5bb2] has quit [Remote host closed the connection] 12:03 < stefanha> With the current Azure quota QEMU could spend around $500/month on Azure container service and nodes. 12:03 -!- bkircher [~bk@2001:a61:251f:7001:8aae:ddff:fe01:5bb2] has joined #qemu 12:04 < danpb> we burnt through 100,000 in about 2.5 weeks so would need to allow for perhaps another 50,000 wallclock minutes at that rate 12:04 < stefanha> danpb: I think it's still worth a shot with a $500/month budget. 12:04 < bonzini> AWS Fargate has 60000 minutes * vCPU at 60 $/month 12:04 < danpb> yeah it does seems like its worth a try to use Azure since we have the resources there going otherwise unused 12:04 < bonzini> Azure I think it was $1000/year 12:04 < bonzini> which is the same 12:04 -!- iggy [~iggy@47.152.10.131] has joined #qemu 12:05 < bonzini> Average duration: 40 minutes = 0.67 hours 12:05 < bonzini> 1,500 tasks x 1 vCPU x 0.67 hours x 0.04048 USD per hour = 40.68 USD for vCPU hours 12:05 < bonzini> 1,500 tasks x 4.00 GB x 0.67 hours x 0.004445 USD per GB per hour = 17.87 USD for GB hours 12:05 < bonzini> 40.68 USD for vCPU hours + 17.87 USD for GB hours = 58.55 USD total 12:05 < stefanha> https://makinhs.medium.com/azure-kubernetes-aks-gitlab-ci-a-short-guide-to-integrate-it-e62a4df5c86a 12:06 < bonzini> stefanha: let's ask if jeff nelson could have someone do it 12:06 < stefanha> bonzini: ok, do you want to ping him? 12:06 -!- Katje [freem...@mail.quixotic.eu] has joined #qemu 12:06 < bonzini> yep 12:06 < stefanha> Thank you!