On Tue, Feb 6, 2018 at 10:40 AM, Andres Rodriguez <andres...@ubuntu-pe.org> wrote: > On Tue, Feb 6, 2018 at 11:24 AM, Jason Hobbs <jason.ho...@canonical.com> > wrote: > >> On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez >> <andres...@ubuntu-pe.org> wrote: >> > I think there's a misunderstanding on how the network boot process >> happens: >> > Let's look at pxe linux first. Pxe linux does this: >> > >> > 1. tries UUID first # if no answer, it moves on >> > 2. Tries mac # if no answer, it moves on >> > 3. tries full IP address # if no answer, it moves on >> > 4. tries partial IP address # if no answer, it moves on >> > 5. does 4 >> > 6. does 4 >> > [...] >> > 7. boots default. >> > >> > This can be seen in here: >> > >> > /mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d >> > /mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd >> > /mybootdir/pxelinux.cfg/C0A8025B >> > /mybootdir/pxelinux.cfg/C0A8025 >> > /mybootdir/pxelinux.cfg/C0A802 >> > /mybootdir/pxelinux.cfg/C0A80 >> > /mybootdir/pxelinux.cfg/C0A8 >> > /mybootdir/pxelinux.cfg/C0A >> > /mybootdir/pxelinux.cfg/C0 >> > /mybootdir/pxelinux.cfg/C >> > /mybootdir/pxelinux.cfg/default >> > >> > >> > That said, in the case of grub, this behavior is similar. You have >> > described this behavior in comment #16. So what is it that's happening: >> > >> > 1. grub is trying grub.cfg-<mac> address multiple times, but since it >> > doesn't get a response, it gives it. >> > 2. Once it gives up, grub.cfg-default-amd64 is tried instead. >> > >> > That said, the requests are handled completely different. The -<mac> >> > requests actually accesses the *node* object in the database by >> searching >> > it with the mac address where the request is made. With this node object, >> > we generate the config file. >> > >> > In comparison, the -default-amd64 does *not* access the node object. It >> > just access two config settings and the db query is *much* cheaper. Also, >> > we have to keep in mind that after grub has done many retries, this >> returns >> > rather fast in comparison because it is not only cheaper, but at that >> point >> > MAAS may be with way less load of queued DB requests. Either way, grub >> > giving up means that it wont expect for the initial request, but it will >> > expect a new response for the new file it asked for. >> > >> > That said, this is working *exactly* as expected, because this >> effectively >> > tells grub "if config for your MAC address was not returned, you can >> safely >> > assume you are an unknown machine to MAAS", hence grub requests a >> different >> > config file to start the enlistment process. >> >> Except it's not an unknown machine, and MAAS treating it like one is >> bad behavior and a bug. > > >> This is not "working exactly as expected". "Working exactly as >> expected" would be my machine being deployed when I asked for it to >> be. >> > > Yes, it is not an unknown machine, but that doesn;t change the fact that > this is working as designed. If the client didn't get a response for the > request it makes, and the client decides to move on and makes a different > request, then it is working as designed. Again, the bug here is not on the > clients behavior, the bug here is on the fact that the response is not > being done in a timely manner.
Yes, agreed 100%. It's not a client bug, it's a server bug. > >> >> > So this is *not* a race condition in MAAS. This is working as designed >> and >> > is expected. The problem here is that MAAS takes too long to answer the >> > initial request, which causes grub to timeout and move on to request a >> > different config file. >> >> Yes, because there is a race condition in the design - the MAC >> specific file has to be generated before grub times out. It could >> instead be generated before the node ever starts booting, allowing it >> to be served just as fast as the -default-amd64 file is, eliminating >> that race condition. >> > > It is not a race condition. It is doing exactly what it was told to do. It > request X thing, didn't get a response, then it requested Y thing, and got > a response. The fact that there's no response when X happens on a /timely/ > manner is not a race, its a bug on the server side. So, if the machine were > to not be known to MAAS, it would work as expected. But since it is known > and the response doesn't come on a timely manner for grub, it moves on. > This is the same behavior pxe, uboot and other network bootloaders follow. Right - it's a bug on the server side! That's what I've been saying. > And yes, you could argue that the config could be generated before the node > starts booting, but what you are not considering is that the node can boot > from any rack controller really and that would require maas to send the > same file to all rack controllers in the same vlan the machine is booting > from and write files onto the disk dynamically, which in fact, can impact > performance even more. The fact the config is generated on the fly is > because it is generated for the specific rack controller where the machine > is booting from and that;'s the intended design. I never suggested the files had to be written to disk, but yes, they would need to be sent to each rack controller that it could boot from. I know it's the intended design, but it has a race condition built in that could be eliminated with another design. That's all I'm saying. It sounds like you agree and you point out there would be trade offs, and that's fine. Jason -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs