Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

Lars Marowsky-Bree Mon, 20 Aug 2012 02:31:44 -0700

On 2012-08-17T18:14:18, "EXTERNAL Konold Martin (erfrakon, RtP2/TEF72)" 
<[email protected]> wrote:


> On the other hand you sofar did not provide any case where SLES11 SP2 runs 
> reliably unmodified in a mission critical environment (e.g. a HA NFS server) 
> without local bugfixes.

Okay, so there's a bug in the NFS agent, point taken. I'll investigate
why it took so long to release as a real maintenance update; you're
right, that shouldn't happen. (I can already see it in the update queue
though.)

> This is exactly the simple example of a resource not working on a fully 
> updated SLES 11 SP2 HA Cluster.

Yes, conceded, but that doesn't mean that other scenarios - HA virtual
machines, OCFS2, ... - aren't working.

We didn't observe the rmtab growing that large here, and yes, it slipped
through.

> This bug is not fixed in SLES 11 SP2 since many months. The fact that you are 
> aware of it but don't make a maintenance release for obvious bugs which are 
> triggered in the default use cases has something to say.
> 
> http://www.suse.com/support/kb/doc.php?id=7008514

A PTF is the first step once a problem has been reported by a support
customer (and is thus immediately available to other customers reporting
the same issue); it then is aggregated with other fixes, handed to QA,
and eventually released as a generic maintenance update. The last step
seems to have taken inappropriately long here, I'll prod the machinery
and figure out why.

> The fact that you don't take care that fixes are available in a timely manner 
> even though you claim that the issue was fixed upstream shows that you SuSE 
> is not commited in supporting missions setups.

We prioritize issues depending on how urgently customers report them.
I'd prefer to release them much more frequently and in smaller
increments, but then our customers complain about the update frequency.
It's a question of balance (that obviously hasn't worked out well
here).

> Are you actually running yourself a single instance of a SLES 11 SP2 cluster 
> in production?

Yes. We've got multiple clusters running in production and, of course,
on development clusters.

> rt-lxcl9b:/var/log # ps uax | grep clvmd
> root      3227  0.0  0.0 149100 46404 ?        SLsl Aug10   0:24 
> /usr/sbin/clvmd -d0
> 
> These are about 74000 (*) messages from clvmd in about 40h.

Woah. And no, I don't see this here. Sorry. I'll investigate further.
Can you provide a log excerpt please? (Never mind, I see that you did
that below.)

> (In my case I do consulting work for a customer who wishes to evaluate if 
> migrating to SLES 11 SP2 is an option for mission critical workloads.
> I waited till SP2 was released before even starting the evaluation just to 
> find out that SP2 fails in the simple test cases with configurations 
> verbatimly copied from SLES HA documentation.
> This customer buys SLES/RH/Windows licenses and support in bulk from a large 
> multi-national. It is not feasible to buy in addition an extra support 
> contract directly from SuSE just to be able to _report_ a bug or to provide a 
> patch.)

In such cases, a sales engineer would be able to help with bugs during
the evaluation phase, and make sure that for the evaluation/PoC you
already get the same priority as you'd later.

But yes, I'm afraid that our policies don't account for bugs against SLE
being reported directly, without either involving sales or having an
active customer/partner support contract.

> In this case my customer has a "high-grade technology partner" which of 
> course has proper contracts with SuSE but my job is to

This sentence appears cut-off? In any case, if the customer *has* such a
partner, reporting such issues via those channels would be preferable.

The PE constraint issue is one I find worrying too. I'd like to see a PE
input for that; thankfully, the PE is designed to be debugable.

> I admit that I am unable to convince you that the fact that SLES11 SP2 fully 
> uptodate does not work reliably even for the most simple use case with a 
> setup copied verbatim from SLES 11 SP2 HA documentation.

Bugs happen, even in documented cases. We've not observed that during
our testing (we actually had an external partner validate the NFS server
use case too).

> BTW: I was assuming that it was part of your job description to make sure 
> that critical upstream/community fixes get integrated into the SLES 11 HA 
> Extension in a timely manner. I guess that I am wrong.

Thanks for the personal attack, it is appreciated ;-)

It is. I'll figure out why and where it got stuck; but the fact remains
that we've not had other support customers report this yet; and if they
had, they'd have been provided the PTF (as well as increasing the
business priority on the workflow in the maintenance queue).

You've worked for a Linux distributor in the past - you know how the
business model works.

> (*) The log fills up with the same rather useless debug output every 30 
> seconds:

For some reason, debug mode isn't being disabled in your environment.
Looking at the code, I can't immediately see why not, but I'll check it
in my environment too.


Regards,
    Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

Reply via email to