On Wed, 30 Jul 2014, Ruben S. Montero wrote:
Hi, 1.- monitor_ds.sh may use LVM commands (vgdisplay) that needs sudo access. It should be automatically setup by the opennebula node packages.2.- It is not a real daemon, the first time a host is monitored a process is left to periodically send information. OpenNebula restarts it if no information is received in 3 monitor steps. Nothing needs to be set up... Cheers
On further inspection I found that this collectd was running on my nodes, and obviously failing up until now because the sudoers was not set correctly. But there was nothing to warn us about it. Nothing on
the opennebula head node to even tell us that the information was stale. No log file on the node to show the errors we were getting. In short, it was just quietly dying and we had no idea. How to make sure this doesn't happen again in the future? Steve Timm
On Wed, Jul 30, 2014 at 3:50 PM, Steven Timm <[email protected]> wrote: On Wed, 30 Jul 2014, Ruben S. Montero wrote: Maybe you could try to execute the monitor probes in the node, 1. ssh the node 2. Go to /var/tmp/one/im 3. Execute run_probes kvm-probes When I do that, (using sh -x ) I get the following: -bash-4.1$ sh -x ./run_probes kvm-probes ++ dirname ./run_probes + source ./../scripts_common.sh ++ export LANG=C ++ LANG=C ++ export PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin ++ PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin ++ AWK=awk ++ BASH=bash ++ CUT=cut ++ DATE=date ++ DD=dd ++ DF=df ++ DU=du ++ GREP=grep ++ ISCSIADM=iscsiadm ++ LVCREATE=lvcreate ++ LVREMOVE=lvremove ++ LVRENAME=lvrename ++ LVS=lvs ++ LN=ln ++ MD5SUM=md5sum ++ MKFS=mkfs ++ MKISOFS=genisoimage ++ MKSWAP=mkswap ++ QEMU_IMG=qemu-img ++ RADOS=rados ++ RBD=rbd ++ READLINK=readlink ++ RM=rm ++ SCP=scp ++ SED=sed ++ SSH=ssh ++ SUDO=sudo ++ SYNC=sync ++ TAR=tar ++ TGTADM=tgtadm ++ TGTADMIN=tgt-admin ++ TGTSETUPLUN=tgt-setup-lun-one ++ TR=tr ++ VGDISPLAY=vgdisplay ++ VMKFSTOOLS=vmkfstools ++ WGET=wget +++ uname -s ++ '[' xLinux = xLinux ']' ++ SED='sed -r' +++ basename ./run_probes ++ SCRIPT_NAME=run_probes + export LANG=C + LANG=C + HYPERVISOR_DIR=kvm-probes.d + ARGUMENTS=kvm-probes ++ dirname ./run_probes + SCRIPTS_DIR=. + cd . ++ '[' -d kvm-probes.d ']' ++ run_dir kvm-probes.d ++ cd kvm-probes.d +++ ls architecture.sh collectd-client-shepherd.sh cpu.sh kvm.rb monitor_ds.sh name.sh poll.sh version.sh ++ for i in '`ls *`' ++ '[' -x architecture.sh ']' ++ ./architecture.sh kvm-probes ++ EXIT_CODE=0 ++ '[' x0 '!=' x0 ']' ++ for i in '`ls *`' ++ '[' -x collectd-client-shepherd.sh ']' ++ ./collectd-client-shepherd.sh kvm-probes ++ EXIT_CODE=0 ++ '[' x0 '!=' x0 ']' ++ for i in '`ls *`' ++ '[' -x cpu.sh ']' ++ ./cpu.sh kvm-probes ++ EXIT_CODE=0 ++ '[' x0 '!=' x0 ']' ++ for i in '`ls *`' ++ '[' -x kvm.rb ']' ++ ./kvm.rb kvm-probes ++ EXIT_CODE=0 ++ '[' x0 '!=' x0 ']' ++ for i in '`ls *`' ++ '[' -x monitor_ds.sh ']' ++ ./monitor_ds.sh kvm-probes [sudo] password for oneadmin: and it stays hung on the password for oneadmin. What's going on? Also, you mentioned a collectd--are you saying that OpenNebula 4.6 now needs to run a daemon on every single VM host? Where is it documented on how to set it up? Steve Make sure you do not have a host using the same hostname fgtest14 and running a collectd process On Jul 29, 2014 4:35 PM, "Steven Timm" <[email protected]> wrote: I am still trying to debug a nasty monitoring inconsistency. -bash-4.1$ onevm list | grep fgtest14 26 oneadmin oneadmin fgt6x4-26 runn 6 4G fgtest14 117d 19h50 27 oneadmin oneadmin fgt5x4-27 runn 10 4G fgtest14 117d 17h57 28 oneadmin oneadmin fgt1x1-28 runn 10 4.1G fgtest14 117d 16h59 30 oneadmin oneadmin fgt5x1-30 runn 0 4G fgtest14 116d 23h50 33 oneadmin oneadmin ip6sl5vda-33 runn 6 4G fgtest14 116d 19h57 -bash-4.1$ onehost list ID NAME CLUSTER RVM ALLOCATED_CPU ALLOCATED_MEM STAT 3 fgtest11 ipv6 0 0 / 400 (0%) 0K / 15.7G (0%) on 4 fgtest12 ipv6 0 0 / 400 (0%) 0K / 15.7G (0%) on 7 fgtest13 ipv6 0 0 / 800 (0%) 0K / 23.6G (0%) on 8 fgtest14 ipv6 5 0 / 800 (0%) 0K / 23.6G (0%) on 9 fgtest20 ipv6 3 300 / 800 (37%) 12G / 31.4G (38%) on 11 fgtest19 ipv6 0 0 / 800 (0%) 0K / 31.5G (0%) on -bash-4.1$ onehost show 8 HOST 8 INFORMATION ID : 8 NAME : fgtest14 CLUSTER : ipv6 STATE : MONITORED IM_MAD : kvm VM_MAD : kvm VN_MAD : dummy LAST MONITORING TIME : 07/29 09:25:45 HOST SHARES TOTAL MEM : 23.6G USED MEM (REAL) : 876.4M USED MEM (ALLOCATED) : 0K TOTAL CPU : 800 USED CPU (REAL) : 0 USED CPU (ALLOCATED) : 0 RUNNING VMS : 5 LOCAL SYSTEM DATASTORE #102 CAPACITY TOTAL: : 548.8G USED: : 175.3G FREE: : 345.6G MONITORING INFORMATION ARCH="x86_64" CPUSPEED="2992" HOSTNAME="fgtest14.fnal.gov" HYPERVISOR="kvm" MODELNAME="Intel(R) Xeon(R) CPU E5450 @ 3.00GHz" NETRX="234844577" NETTX="21553126" RESERVED_CPU="" RESERVED_MEM="" VERSION="4.6.0" VIRTUAL MACHINES ID USER GROUP NAME STAT UCPU UMEM HOST TIME 26 oneadmin oneadmin fgt6x4-26 runn 6 4G fgtest14 117d 19h50 27 oneadmin oneadmin fgt5x4-27 runn 10 4G fgtest14 117d 17h57 28 oneadmin oneadmin fgt1x1-28 runn 10 4.1G fgtest14 117d 17h00 30 oneadmin oneadmin fgt5x1-30 runn 0 4G fgtest14 116d 23h50 33 oneadmin oneadmin ip6sl5vda-33 runn 6 4G fgtest14 116d 19h57 ----------------------------------------------------------------------------------- All of this looks great, right? Just one problem: There are no VM's running on fgtest14 and haven't been for 4 days. [root@fgtest14 ~]# virsh list Id Name State ---------------------------------------------------- [root@fgtest14 ~]# ------------------------------------------------------------------------- Yet the monitoring reports no errors. Tue Jul 29 09:28:10 2014 [InM][D]: Host fgtest14 (8) successfully monitored. ----------------------------------------------------------------------------- At the same time, there is no evidence that ONE is actually trying to or succeeding to monitor these five vm's yet they are still stuck in "runn" which means I can't do a onevm restart to restart them. (the vm images of these 5 vm's are still out there on the VM host and I would like to save and restart them if I can). What is the remotes command that ONE4.6 would use to monitor this host? Can I do it manually and see what output I get? Are we dealing with some kind of a bug, or just a very confused system? Any help is appreciated. I have to get this sorted out before I dare deploy one4.x in production. Steve Timm ------------------------------------------------------------------ Steven C. Timm, Ph.D (630) 840-8525 [email protected] http://home.fnal.gov/~timm/ Fermilab Scientific Computing Division, Scientific Computing Services Quad. Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing _______________________________________________ Users mailing list [email protected] http://lists.opennebula.org/listinfo.cgi/users-opennebula.org ------------------------------------------------------------------ Steven C. Timm, Ph.D (630) 840-8525 [email protected] http://home.fnal.gov/~timm/ Fermilab Scientific Computing Division, Scientific Computing Services Quad. Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing -- -- Ruben S. Montero, PhD Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise Cloud Made Simple www.OpenNebula.org | [email protected] | @OpenNebula
------------------------------------------------------------------ Steven C. Timm, Ph.D (630) 840-8525 [email protected] http://home.fnal.gov/~timm/ Fermilab Scientific Computing Division, Scientific Computing Services Quad. Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
_______________________________________________ Users mailing list [email protected] http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
