I forgot how to reply-all here :) // jim
On Wed, May 18, 2016 at 05:35:55PM -0400, Jim Rollenhagen wrote: > On Tue, May 17, 2016 at 11:32:25PM +0100, Stig Telfer wrote: > > Is there anywhere that these experiences can be captured in a way that > > might help? > > > > For example, I have a few DRAC-managed servers. About half have fallen > > into a state where the pxe_drac driver can’t do anything with them > > (python-dracclient claims another transaction is underway). But > > pxe_ipmitool works happily. > > > > I’m pretty sure Ironic is not at fault here so it doesn’t seem fair to > > catalogue these things as Ironic bugs. Perhaps the best action would be > > for Ironic to be more informative when it identifies a BMC is playing up. > > > > Jay and Jim - any thoughts? > > Yeah, unfortunately we can't fix the terribleness of all the BMCs in the > world. We are working on a few different efforts to help operators deal > with these, generally (which are described in my summit wrapup). > Nova-style notifications, BMC reset APIs, automatically returning nodes > to service when a BMC is reachable again, etc. > > I'd totally file a bug with python-dracclient for the specific DRAC > thing you mentioned. > > In general, feel free to file bugs, if it's something we can deal with > we will triage it, if not we'll keep it in mind for the more general > handling of these things. > > Does that help? > > // jim > > > > > Best wishes, > > Stig > > > > > > > On 12 May 2016, at 11:37, Peter Love <p.l...@lancaster.ac.uk> wrote: > > > > > > Nice talk on this stuff: https://www.youtube.com/watch?v=GZeUntdObCA > > > > > > On 12 May 2016 at 10:54, Matt Jarvis <matt.jar...@datacentred.co.uk> > > > wrote: > > >> Very familiar list Tim, and we end up working around a lot of them with > > >> horrible hardware specific code. Our bugbears also include : > > >> > > >> Required configuration only being available via a web interface - eg. > > >> setting hostname of the BMC on Supermicro hardware > > >> IPMI hanging and requiring complete removal and reload of the kernel > > >> modules > > >> to enable resetting > > >> Undocumented functions requiring raw IPMI commands - again on Supermicro > > >> there is some black magic to set dedicated ports, check power supply > > >> status > > >> etc. > > >> Web interfaces requiring Java, and totally broken on mainstream browsers > > >> - > > >> HP ILO's in particular, which are almost impossible to use with a Mac. > > >> Firmware and BIOS'es which don't allow command line updating from inside > > >> a > > >> running OS > > >> > > >> We're used to being able to flash BIOS images and CMOS settings by > > >> writing > > >> directly to the memory addresses, but more and more modern hardware won't > > >> let you do this anymore :( > > >> > > >> We're hoping Redfish will solve some of the configuration related issues, > > >> although obviously it won't make any difference to flaky BMC > > >> implementations > > >> and proprietary tooling to update firmware. > > >> > > >> On 12 May 2016 at 06:25, Tim Bell <tim.b...@cern.ch> wrote: > > >>> > > >>> > > >>> > > >>> On 12/05/16 06:22, "Stig Telfer" <stig.openst...@telfer.org> wrote: > > >>> > > >>>> Hi All - > > >>>> > > >>>> Jim Rollenhagen from the Ironic project has just posted a great summit > > >>>> report of Ironic team activities on the openstack-devs mailing list[1], > > >>>> which included this item which will be of interest to the Scientific WG > > >>>> members who are looking to work on bare metal activities this cycle: > > >>>> > > >>>>> # Making ops less worse > > >>>>> > > >>>>> [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-ops) > > >>>>> > > >>>>> We discussed some common failure cases that operators see, and how we > > >>>>> can solve them in code. > > >>>>> > > >>>>> We discussed flaky BMCs, which end with the node in maintenance mode, > > >>>>> and if Ironic can get them out of that mode automagically. We > > >>>>> identified > > >>>>> the need to distinguish between maintenance set by ironic and set by > > >>>>> operators, and do things like attempt to connect to the BMC on a power > > >>>>> state request, and turn off maintenance mode if successful. JayF is > > >>>>> going to write a spec for this differentiation. > > >>>>> > > >>>>> Folks also expressed the desire to be able to reset the BMC via APIs. > > >>>>> We > > >>>>> have a BMC reset function in the vendor interface for the ipmitool > > >>>>> driver; dtantsur volunteered to write a spec to promote that method to > > >>>>> an official ManagementInterface method. > > >>>>> > > >>>>> We also talked for a while about stuck states. This has been mostly > > >>>>> solved in code, but is still a problem for some deployers. We decided > > >>>>> that we should not have a "reset-state" API like nova does, but rather > > >>>>> a > > >>>>> command line tool to handle this. lintan has volunteered to write a > > >>>>> proposal for this; I have also posted some [straw man > > >>>>> code](https://review.openstack.org/#/c/311273/) that someone is > > >>>>> welcome > > >>>>> to take over or use. > > >>>> > > >>>> The operator issues already identified cover some things we’ve hit at > > >>>> Cambridge, please do scan through and contribute if there is anything > > >>>> they > > >>>> have not covered. > > >>>> > > >>> > > >>> We have certainly had our share of BMC problems through the years. It is > > >>> often frustrating as the very time you find you need the console, it is > > >>> not > > >>> working. Having Ironic doing an active monitoring (without overloading) > > >>> would be a real help. > > >>> > > >>> The other item we’ve found difficult has been in the configuration: > > >>> > > >>> - Software maintenance is very limited. Some vendors choose to produce > > >>> new > > >>> versions of the BMC microcode without changing the version number > > >>> reported > > >>> by the BMC which makes consistent management difficult. There is no > > >>> common > > >>> API defined for updating the code. > > >>> - Implementations between IPMI 1.5 and IPMI 2.0 vary significantly and > > >>> between commodity white boxes and blades > > >>> - BMCs have different Lan channels according to manufacturer for remote > > >>> access > > >>> - The tty speeds vary which means that the booted OS needs to have > > >>> different cmdlines for the kernel according to the underlying hardware > > >>> - the number of additional accounts is limited in some BMCs and password > > >>> management is very basic. Currently, we define distinct users for > > >>> read-only > > >>> access to the SDRs (e.g. monitoring), console and power operations since > > >>> these need to be kept in different systems. We also have unique > > >>> passwords > > >>> for each machine, all of which requires tracking. Foreman helps here > > >>> but it > > >>> is not ideal. > > >>> - BMC replacement is also frequent. A process to re-import a replacement > > >>> BMC (new MAC, no user accounts defined) would re-installing the box is > > >>> needed. > > >>> - we have a fairly complex reset process which hits the BMC with > > >>> different > > >>> levels of reset. We’ve also sometimes found the need to reset the IPMI > > >>> kernel modules at the same time which go into a loop. > > >>> > > >>> I’m not expecting Ironic to fix all of this but it would be great to > > >>> have > > >>> a block of code which we can gradually improve together. There are other > > >>> good initiatives like OpenBMC but they won’t help with the existing > > >>> boxes. > > >>> > > >>> I think my best advice to Ironic for BMC management would be consider > > >>> the > > >>> BMC as a potentially unreliable device. Thus, along with performing the > > >>> actions, checking they completed and probing that a function which was > > >>> working an hour ago is still working now (but not overloading it)… > > >>> we’ll be > > >>> looking at Ironic this year so we’ll be able to help on the failure > > >>> cases. > > >>> > > >>> Tim > > >>> > > >>>> Best wishes, > > >>>> Stig > > >>>> > > >>>> [1] > > >>>> http://lists.openstack.org/pipermail/openstack-dev/2016-May/094658.html > > >>>> _______________________________________________ > > >>>> OpenStack-operators mailing list > > >>>> OpenStack-operators@lists.openstack.org > > >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > > >>> > > >>> _______________________________________________ > > >>> OpenStack-operators mailing list > > >>> OpenStack-operators@lists.openstack.org > > >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > > >> > > >> > > >> > > >> DataCentred Limited registered in England and Wales no. 05611763 > > >> _______________________________________________ > > >> OpenStack-operators mailing list > > >> OpenStack-operators@lists.openstack.org > > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > > >> > > > > > > _______________________________________________ > > > OpenStack-operators mailing list > > > OpenStack-operators@lists.openstack.org > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > > _______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators