It doesn't really sound like an agent problem, but some other root
problem that is causing issues for the agent. Perhaps it is specific
to the host simply because there is a particular VM that always runs
on that host and the VM itself is triggering the issue. Perhaps a
heavy logrotate or cron job on the vm causes issues for librados. Just
grasping at straws here. From the output provided it does seem that
the libvirt bindings that include ceph code are terminating the agent
execution.  My guess is that if you focus on "why this host" as
opposed to "what's going on", you'll find the answer to both. Sorry, I
know that's not much help.

On Mon, Feb 23, 2015 at 7:29 AM, Andrija Panic <andrija.pa...@gmail.com> wrote:
> Anybody?, before I start to cry :(
>
> On 21 February 2015 at 21:18, Andrija Panic <andrija.pa...@gmail.com> wrote:
>
>> HI Simon,
>>
>> selinux is disabled, I have just double checked.
>>
>> BTW, this is what I can see in the cloudstack-agent.err log - seems like
>> some CEPH related issues, but not sure why would agent die...
>> If I recall correclty, this might be happening since the CEPH update from
>> 0.80.3? to 0.87 - and this seesm like some crash in librados....
>>
>>
>> libust[1907/2046]: Warning: HOME environment variable not set. Disabling
>> LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305)
>> libvirt:  error : name in virDomainLookupByName must not be NULL
>> libvirt:  error : name in virDomainLookupByName must not be NULL
>> libvirt:  error : name in virDomainLookupByName must not be NULL
>> libvirt:  error : name in virDomainLookupByName must not be NULL
>> libvirt: Storage Driver error : failed to remove volume
>> 'cloudstack/bd751250-de35-4d2e-a4e3-3ee4b636c2a7': Device or resource busy
>> ./log/SubsystemMap.h: In function 'bool
>> ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread
>> 7f04427fc700 time 2015-02-21 06:39:38.839210
>> ./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
>>  ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
>>  1: (()+0x1fe223) [0x7f060c932223]
>>  2: (ObjectCacher::flusher_entry()+0x155) [0x7f060c9866e5]
>>  3: (ObjectCacher::FlusherThread::entry()+0xd) [0x7f060c9976cd]
>>  4: (()+0x79d1) [0x7f06605ee9d1]
>>  5: (clone()+0x6d) [0x7f066033bb5d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>> terminate called after throwing an instance of 'ceph::FailedAssertion'
>> 21/02/2015 06:39:38 1905 jsvc.exec error: Service did not exit cleanly
>>
>> On 20 February 2015 at 21:56, Simon Weller <swel...@ena.com> wrote:
>>
>>> Andrija,
>>>
>>> What is SELinux set to on this host?
>>>
>>>
>>> - SI
>>>
>>>
>>> ________________________________________
>>> From: Andrija Panic <andrija.pa...@gmail.com>
>>> Sent: Friday, February 20, 2015 6:06 AM
>>> To: dev@cloudstack.apache.org; us...@cloudstack.apache.org
>>> Subject: Agent dies every night/morning.... memory violation
>>>
>>> Hi,
>>>
>>> I have crazy agent on one of the hosts, that is being killed each morning
>>> and I found this in /var/log/audit.log:
>>>
>>> type=ANOM_ABEND msg=audit(1424321463.930:430678): auid=0 uid=0 gid=0
>>> ses=68891 pid=10831 comm="jsvc" reason="memory violation" sig=6
>>>
>>> I dont remember changing anything on the system, but this keeps happening
>>> each morning arrond same time 5.20am-5.40am.
>>>
>>> I'm wondering what the hack is happening, any suggestions where to
>>> troubleshoot ?
>>> Will check logs in details anyway...
>>>
>>> --
>>>
>>> Andrija Panić
>>>
>>
>>
>>
>> --
>>
>> Andrija Panić
>>
>
>
>
> --
>
> Andrija Panić

Reply via email to