On Wed, 30 Jul 2014, Ruben S. Montero wrote:

Hi,
1.- monitor_ds.sh may use LVM commands (vgdisplay) that needs sudo access. It 
should be automatically setup by the opennebula node
packages. 

2.- It is not a real daemon, the first time a host is monitored a process is 
left to periodically send information. OpenNebula
restarts it if no information is received in 3 monitor steps. Nothing needs to 
be set up...

Cheers


On further inspection I found that this collectd was running on my nodes, and obviously failing up until now because the sudoers was not set correctly. But there was nothing to warn us about it. Nothing on
the opennebula head node to even tell us that the information was stale.
No log file on the node to show the errors we were getting. In short,
it was just quietly dying and we had no idea.  How to make sure this
doesn't happen again in the future?

Steve Timm






On Wed, Jul 30, 2014 at 3:50 PM, Steven Timm <[email protected]> wrote:
      On Wed, 30 Jul 2014, Ruben S. Montero wrote:


            Maybe you could try to execute the  monitor probes in the node, 

            1. ssh the node
            2. Go to /var/tmp/one/im
            3. Execute run_probes kvm-probes


      When I do that, (using sh -x ) I get the following:

      -bash-4.1$ sh -x ./run_probes kvm-probes
      ++ dirname ./run_probes
      + source ./../scripts_common.sh
      ++ export LANG=C
      ++ LANG=C
      ++ export
      
PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
      ++
      
PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
      ++ AWK=awk
      ++ BASH=bash
      ++ CUT=cut
      ++ DATE=date
      ++ DD=dd
      ++ DF=df
      ++ DU=du
      ++ GREP=grep
      ++ ISCSIADM=iscsiadm
      ++ LVCREATE=lvcreate
      ++ LVREMOVE=lvremove
      ++ LVRENAME=lvrename
      ++ LVS=lvs
      ++ LN=ln
      ++ MD5SUM=md5sum
      ++ MKFS=mkfs
      ++ MKISOFS=genisoimage
      ++ MKSWAP=mkswap
      ++ QEMU_IMG=qemu-img
      ++ RADOS=rados
      ++ RBD=rbd
      ++ READLINK=readlink
      ++ RM=rm
      ++ SCP=scp
      ++ SED=sed
      ++ SSH=ssh
      ++ SUDO=sudo
      ++ SYNC=sync
      ++ TAR=tar
      ++ TGTADM=tgtadm
      ++ TGTADMIN=tgt-admin
      ++ TGTSETUPLUN=tgt-setup-lun-one
      ++ TR=tr
      ++ VGDISPLAY=vgdisplay
      ++ VMKFSTOOLS=vmkfstools
      ++ WGET=wget
      +++ uname -s
      ++ '[' xLinux = xLinux ']'
      ++ SED='sed -r'
      +++ basename ./run_probes
      ++ SCRIPT_NAME=run_probes
      + export LANG=C
      + LANG=C
      + HYPERVISOR_DIR=kvm-probes.d
      + ARGUMENTS=kvm-probes
      ++ dirname ./run_probes
      + SCRIPTS_DIR=.
      + cd .
      ++ '[' -d kvm-probes.d ']'
      ++ run_dir kvm-probes.d
      ++ cd kvm-probes.d
      +++ ls architecture.sh collectd-client-shepherd.sh cpu.sh kvm.rb 
monitor_ds.sh name.sh poll.sh version.sh
      ++ for i in '`ls *`'
      ++ '[' -x architecture.sh ']'
      ++ ./architecture.sh kvm-probes
      ++ EXIT_CODE=0
      ++ '[' x0 '!=' x0 ']'
      ++ for i in '`ls *`'
      ++ '[' -x collectd-client-shepherd.sh ']'
      ++ ./collectd-client-shepherd.sh kvm-probes
      ++ EXIT_CODE=0
      ++ '[' x0 '!=' x0 ']'
      ++ for i in '`ls *`'
      ++ '[' -x cpu.sh ']'
      ++ ./cpu.sh kvm-probes
      ++ EXIT_CODE=0
      ++ '[' x0 '!=' x0 ']'
      ++ for i in '`ls *`'
      ++ '[' -x kvm.rb ']'
      ++ ./kvm.rb kvm-probes
      ++ EXIT_CODE=0
      ++ '[' x0 '!=' x0 ']'
      ++ for i in '`ls *`'
      ++ '[' -x monitor_ds.sh ']'
      ++ ./monitor_ds.sh kvm-probes
      [sudo] password for oneadmin:

      and it stays hung on the password for oneadmin.

      What's going on?

      Also, you mentioned a collectd--are you saying that OpenNebula 4.6 now 
needs to run a daemon on every single VM host?
       Where is it documented
      on how to set it up?

      Steve







            Make sure you do not have a host using the same hostname fgtest14 
and running a  collectd process

            On Jul 29, 2014 4:35 PM, "Steven Timm" <[email protected]> wrote:

                  I am still trying to debug a nasty monitoring inconsistency.

                  -bash-4.1$ onevm list | grep fgtest14
                      26 oneadmin oneadmin fgt6x4-26       runn    6      4G 
fgtest14   117d 19h50
                      27 oneadmin oneadmin fgt5x4-27       runn   10      4G 
fgtest14   117d 17h57
                      28 oneadmin oneadmin fgt1x1-28       runn   10    4.1G 
fgtest14   117d 16h59
                      30 oneadmin oneadmin fgt5x1-30       runn    0      4G 
fgtest14   116d 23h50
                      33 oneadmin oneadmin ip6sl5vda-33    runn    6      4G 
fgtest14   116d 19h57
                  -bash-4.1$ onehost list
                    ID NAME            CLUSTER   RVM      ALLOCATED_CPU      
ALLOCATED_MEM STAT
                     3 fgtest11        ipv6        0       0 / 400 (0%)    0K / 
15.7G (0%) on
                     4 fgtest12        ipv6        0       0 / 400 (0%)    0K / 
15.7G (0%) on
                     7 fgtest13        ipv6        0       0 / 800 (0%)    0K / 
23.6G (0%) on
                     8 fgtest14        ipv6        5       0 / 800 (0%)    0K / 
23.6G (0%) on
                     9 fgtest20        ipv6        3    300 / 800 (37%)  12G / 
31.4G (38%) on
                    11 fgtest19        ipv6        0       0 / 800 (0%)    0K / 
31.5G (0%) on
                  -bash-4.1$ onehost show 8
                  HOST 8 INFORMATION
                  ID                    : 8
                  NAME                  : fgtest14
                  CLUSTER               : ipv6
                  STATE                 : MONITORED
                  IM_MAD                : kvm
                  VM_MAD                : kvm
                  VN_MAD                : dummy
                  LAST MONITORING TIME  : 07/29 09:25:45

                  HOST SHARES
                  TOTAL MEM             : 23.6G
                  USED MEM (REAL)       : 876.4M
                  USED MEM (ALLOCATED)  : 0K
                  TOTAL CPU             : 800
                  USED CPU (REAL)       : 0
                  USED CPU (ALLOCATED)  : 0
                  RUNNING VMS           : 5

                  LOCAL SYSTEM DATASTORE #102 CAPACITY
                  TOTAL:                : 548.8G
                  USED:                 : 175.3G
                  FREE:                 : 345.6G

                  MONITORING INFORMATION
                  ARCH="x86_64"
                  CPUSPEED="2992"
                  HOSTNAME="fgtest14.fnal.gov"
                  HYPERVISOR="kvm"
                  MODELNAME="Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz"
                  NETRX="234844577"
                  NETTX="21553126"
                  RESERVED_CPU=""
                  RESERVED_MEM=""
                  VERSION="4.6.0"

                  VIRTUAL MACHINES

                      ID USER     GROUP    NAME            STAT UCPU    UMEM 
HOST TIME
                      26 oneadmin oneadmin fgt6x4-26       runn    6      4G 
fgtest14   117d 19h50
                      27 oneadmin oneadmin fgt5x4-27       runn   10      4G 
fgtest14   117d 17h57
                      28 oneadmin oneadmin fgt1x1-28       runn   10    4.1G 
fgtest14   117d 17h00
                      30 oneadmin oneadmin fgt5x1-30       runn    0      4G 
fgtest14   116d 23h50
                      33 oneadmin oneadmin ip6sl5vda-33    runn    6      4G 
fgtest14   116d 19h57
                  
-----------------------------------------------------------------------------------

                  All of this looks great, right?
                  Just one problem:  There are no VM's running on fgtest14 and
                  haven't been for 4 days.

                  [root@fgtest14 ~]# virsh list
                   Id    Name                           State
                  ----------------------------------------------------

                  [root@fgtest14 ~]#

                  
-------------------------------------------------------------------------
                  Yet the monitoring reports no errors.

                  Tue Jul 29 09:28:10 2014 [InM][D]: Host fgtest14 (8) 
successfully monitored.

                  
-----------------------------------------------------------------------------
                  At the same time, there is no evidence that ONE is actually 
trying to or
                  succeeding to monitor these five vm's yet they are still stuck in 
"runn"
                  which means I can't do a onevm restart to restart them.
                  (the vm images of these 5 vm's are still out there on the VM 
host and
                  I would like to save and restart them if I can).

                  What is the remotes command that ONE4.6 would use to monitor 
this host?
                  Can I do it manually and see what output I get?

                  Are we dealing with some kind of a bug, or just a very 
confused system?
                  Any help is appreciated. I have to get this sorted out before
                  I dare deploy one4.x in production.

                  Steve Timm


                  
------------------------------------------------------------------
                  Steven C. Timm, Ph.D  (630) 840-8525
                  [email protected]  http://home.fnal.gov/~timm/
                  Fermilab Scientific Computing Division, Scientific Computing 
Services Quad.
                  Grid and Cloud Services Dept., Associate Dept. Head for Cloud 
Computing
                  _______________________________________________
                  Users mailing list
                  [email protected]
                  http://lists.opennebula.org/listinfo.cgi/users-opennebula.org




      ------------------------------------------------------------------
      Steven C. Timm, Ph.D  (630) 840-8525
      [email protected]  http://home.fnal.gov/~timm/
      Fermilab Scientific Computing Division, Scientific Computing Services 
Quad.
      Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing




--
-- 
Ruben S. Montero, PhD
Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise Cloud Made 
Simple
www.OpenNebula.org | [email protected] | @OpenNebula



------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
[email protected]  http://home.fnal.gov/~timm/
Fermilab Scientific Computing Division, Scientific Computing Services Quad.
Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
_______________________________________________
Users mailing list
[email protected]
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

Reply via email to