I think there's a misunderstanding on how the network boot process happens:
Let's look at pxe linux first. Pxe linux does this:

1. tries UUID first # if no answer, it moves on
2. Tries mac # if no answer, it moves on
3. tries full IP address # if no answer, it moves on
4. tries partial IP address # if no answer, it moves on
5. does 4
6. does 4
[...]
7. boots default.

This can be seen in here:

/mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d
/mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd
/mybootdir/pxelinux.cfg/C0A8025B
/mybootdir/pxelinux.cfg/C0A8025
/mybootdir/pxelinux.cfg/C0A802
/mybootdir/pxelinux.cfg/C0A80
/mybootdir/pxelinux.cfg/C0A8
/mybootdir/pxelinux.cfg/C0A
/mybootdir/pxelinux.cfg/C0
/mybootdir/pxelinux.cfg/C
/mybootdir/pxelinux.cfg/default


That said, in the case of grub, this behavior is similar. You have
described this behavior in comment #16. So what is it that's happening:

1. grub is trying grub.cfg-<mac> address multiple times, but since it
doesn't get a response, it gives it.
2. Once it gives up, grub.cfg-default-amd64 is tried instead.

That said, the requests are handled completely different. The -<mac>
requests actually accesses the *node* object in the database  by searching
it with the mac address where the request is made. With this node object,
we generate the config file.

In comparison, the -default-amd64 does *not* access the node object. It
just access two config settings and the db query is *much* cheaper. Also,
we have to keep in mind that after grub has done many retries, this returns
rather fast in comparison because it is not only cheaper, but at that point
MAAS may be with way less load of queued DB requests. Either way, grub
giving up means that it wont expect for the initial request, but it will
expect a new response for the new file it asked for.

That said, this is working *exactly* as expected, because this effectively
tells grub "if config for your MAC address was not returned, you can safely
assume you are an unknown machine to MAAS", hence grub requests a different
config file to start the enlistment process.

So this is *not* a race condition in MAAS. This is working as designed and
is expected. The problem here is that MAAS takes too long to answer the
initial request, which causes grub to timeout and move on to request a
different config file.

On Mon, Feb 5, 2018 at 4:30 PM, Jason Hobbs <jason.ho...@canonical.com>
wrote:

> The packetdump (comment #35) of MAAS not responding to grub's request
> for the mac specific grub.cfg before grub times out, and then responding
> immediately to the generic-amd64 grub cfg, clearly shows a race
> condition in MAAS.
>
> MAAS's design of dynamically generating the interface specific grub
> config only after it receives the tftp request for it is susceptible to
> a race condition where grub times out before MAAS can respond.
>
> That design is not the only possible design.  All the information
> required for the interface specific grub.cfg is available before the
> machine ever powers on, and could be made available on the rack
> controllers at that time too.
>
> Doing so would eliminate that race condition, or at least reduce the
> opportunity greatly, as we see MAAS has no problems immediately
> responding and serving files that it doesn't need to dynamically
> generate at request time.
>
> There is still some question around what in the environment is
> contributing to MAAS not responding faster, and what MAAS is doing while
> it takes 60+ seconds to respond to the request, but that doesn't change
> the fact that the current MAAS design is racy (and that's a bug).
>
> Whatever we change in the environment to reduce the likelihood of
> hitting this issue there doesn't solve the underlying race condition in
> MAAS, and leaves open the possibility of hitting the issue other places
> too.
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to