That's interesting. For us, the 7.5 version of libc was causing problems.
Either way, I'm looking forward to hearing about anything you find.

Mike

On Thu, Jan 13, 2011 at 11:47 PM, Erik Onnen <eon...@gmail.com> wrote:

> Too similar to be a coincidence I'd say:
>
> Good node (old AZ):  2.11.1-0ubuntu7.5
> Bad node (new AZ): 2.11.1-0ubuntu7.6
>
> You beat me to the punch with the test program. I was working on something
> similar to test it out and got side tracked.
>
> I'll try the test app tomorrow and verify the versions of the AMIs used for
> provisioning.
>
> On Thu, Jan 13, 2011 at 11:31 PM, Mike Malone <m...@simplegeo.com> wrote:
>
>> Erik, the scenario you're describing is almost identical to what we've
>> been experiencing. Sounds like you've been pulling your hair out too! You're
>> also running the same distro and kernel as us. And we also run without swap.
>> Which begs the question... what version of libc6 are you running!? Here's
>> the output from one of our upgraded boxes:
>>
>> $ dpkg --list | grep libc6
>> ii  libc6                                    2.11.1-0ubuntu7.7
>>     Embedded GNU C Library: Shared libraries
>> ii  libc6-dev                                2.11.1-0ubuntu7.7
>>     Embedded GNU C Library: Development Librarie
>>
>> Before upgrading the version field showed 2.11.1-0ubuntu7.5. Wondering
>> what yours is.
>>
>> We also found ourselves in a similar situation with different regions.
>> We're using the canonical ubuntu ami as the base for our systems. But there
>> appear to be small differences between the packages included in the amis
>> from different regions. Seems libc6 is one of the things that changed. I
>> discovered by diff'ing `dpkg --list` on a node that was good, and one that
>> was bad.
>>
>> The architecture hypothesis is also very interesting. If we could
>> reproduce the bug with the latest libc6 build I'd escalate it back up to
>> Amazon. But I can't repro it, so nothing to escalate.
>>
>> For what it's worth, we were able to reproduce the lockup behavior that
>> you're describing by running a tight loop that spawns threads. Here's a gist
>> of the app I used: https://gist.github.com/a4123705e67e9446f1cc -- I'd be
>> interested to know whether that locks things up on your system with a new
>> libc6.
>>
>> Mike
>>
>> On Thu, Jan 13, 2011 at 10:39 PM, Erik Onnen <eon...@gmail.com> wrote:
>>
>>> May or may not be related but I thought I'd recount a similar experience
>>> we had in EC2 in hopes it helps someone else.
>>>
>>> As background, we had been running several servers in a 0.6.8 ring with
>>> no Cassandra issues (some EC2 issues, but none related to Cassandra) on
>>> multiple EC2 XL instances in a single availability zone. We decided to add
>>> several other nodes to a second AZ for reasons beyond the scope of this
>>> email. As we reached steady operational state in the new AZ, we noticed that
>>> the new nodes in the new AZ were repeatedly getting dropped from the ring.
>>> At first we attributed the drops to phi and expected cross-AZ latency. As we
>>> tried to pinpoint the issue, we found something very similar to what you
>>> describe - the EC2 VMs in the new AZ would become completely unresponsive.
>>> Not just the Java process hosting Cassandra, but the entire host. Shell
>>> commands would not execute for existing sessions, we could not establish new
>>> SSH sessions and tails we had on active files wouldn't show any progress. It
>>> appeared as if the machines in the new AZ would seize for several minutes,
>>> then come back to life with little rhyme or reason as to why. Tickets opened
>>> with AMZN resulted in responses of "the physical server looks normal".
>>>
>>> After digging deeper, here's what we found. To confirm all nodes in both
>>> AZs were identical at the following levels:
>>> * Kernel (2.6.32-305-ec2 #9-Ubuntu SMP) distro (Ubuntu 10.04.1 LTS) and
>>> glibc on x86_64
>>> * All nodes were running identical Java distributions that we deployed
>>> ourselves, sun 1.6.0_22-b04
>>> * Same amount of virtualized RAM visible to the guest, same RAID stripe
>>> configuration across the same size/number of ephemeral drives
>>>
>>> We noticed two things that were different across the VMs in the two AZs:
>>> * The class of CPU exposed to the guest OSes across the two AZs (and
>>> presumably the same physical server above that guest).
>>> ** On hosts in the AZ not having issues, we see from the guest older
>>> Harpertown class Intel CPUs: "model name      : Intel(R) Xeon(R) CPU
>>>       E5430  @ 2.66GHz"
>>> ** On hosts in the AZ having issues, we see from the guest newer Nehalem
>>> class Intel CPUs: "model name      : Intel(R) Xeon(R) CPU
>>> E5507  @ 2.27GHz"
>>> * Percent steal was consistently higher on the new nodes, on average 25%
>>> where as the older (stable) VMs were around 9% at peak load
>>>
>>> Consistently in our case, we only saw this seizing behavior on guests
>>> running on the newer Nehalem architecture CPUs.
>>>
>>> In digging a bit deeper on the problem machines, we also noticed the
>>> following:
>>> * Most of the time, ParNew GC on the problematic hosts was fine,
>>> averaging around .04 "real" seconds. After spending time tuning the
>>> generations and heap size for our workload, we rarely have CMS collections
>>> and almost never have Full GCs, even during full or anti-compactions.
>>> * Rarely, and at the same time as the problematic machines would seize, a
>>> long running ParNew collection would be recorded after the guest came back
>>> to life. Consistently this was between 180 and 220 seconds regardless of
>>> host, plenty of time for that host to be shunned from the ring.
>>>
>>> The long ParNew GCs were a mystery. They *never* happened on the hosts in
>>> the other AZ (the Harpertown class) and rarely happened on the new guests
>>> but we did observe the behavior within three hours of normal operation on
>>> each host in the new AZ.
>>>
>>> After lots of trial and error, we decided to remove ParNew collections
>>> from the equation and tried running a host in the new AZ with
>>> "-XX:-UseParNewGC" and this eliminated the long ParNew problem. The flip
>>> side is, we now do serial collections on the young generation for half our
>>> ring which means those nodes spend about 4x more time in GC than the other
>>>  nodes, but they've been stable for two weeks since the change.
>>>
>>> That's what we know for sure and we're back to operating without a hitch
>>> with the one JVM option change.
>>>
>>> <editorial>
>>> What I think is happening is more complicated. Skip this part if you
>>> don't care about opinion and some of this reasoning is surely incorrect. In
>>> talking with multiple VMWare experts (I don't have much experience in Xen
>>> but I imagine the same is true there as well), it's generally a bad idea to
>>> virtualize too many cores (two seems to be the sweet spot). Reason being
>>> that if you have a heavily multithreaded application and that app relies on
>>> consistent application of memory barriers across multiple cores (as Java
>>> does), the Hypervisor has to wait for multiple physical cores to become
>>> available before it schedules the guest so that each virtual core gets a
>>> consistent view of the virtual memory while scheduled. If the physical
>>> server is overcommitted, that wait time is exacerbated as the guest waits
>>> for the correct number of physical cores to become available (4 in our
>>> case). It's possible to tell this in VMware via esxtop, not sure in Xen. It
>>> would also be somewhat visible via %steal increases in the guest which we
>>> saw, but that doesn't really explain a two minute pause during garbage
>>> collection. My guess then, is that one or more of the following are at play
>>> in this scenario:
>>>
>>> 1) a core nehalem bug - the nehalem architecture made a lot of changes to
>>> the way it manges TLBs for memory, largely as a virtualization optimization.
>>> I doubt this is the case but assuming the guest isn't seeing a different
>>> architecture, we did see this issue only on E5507 processors.
>>> 2) the physical servers in the new AZ are drastically overcommitted -
>>>  maybe AMZN bought into the notion that Nehalems are better at
>>> virtualization and is allowing more guests to run on physical servers
>>> running Nehalems. I've no idea, just a hypothesis.
>>> 3) a hypervisor bug - I've deployed large JVMs to big physical Nehalem
>>> boxen running Cassandra clusters under high load and never seen behavior
>>> like the above. If I could see more of what the hypervisor was doing I'd
>>> have a pretty good idea here, but such is life in the cloud.
>>>
>>> </editorial>
>>>
>>> I also should say that I don't think any issues we had were at all
>>> related specifically to Cassandra. We were running fine in the first AZ, no
>>> problems other than needing to grow capacity. Only when we saw the different
>>> architecture in the new EC2 AZ did we experience problems and when we
>>> shackled the new generation collector, the bad problems went away.
>>>
>>> Sorry for the long tirade. This was originally going to be a blog post
>>> but I though it would have more value in context here. I hope ultimately it
>>> helps someone else.
>>> -erik
>>>
>>>
>>> On Thu, Jan 13, 2011 at 5:26 PM, Mike Malone <m...@simplegeo.com> wrote:
>>>
>>>> Hey folks,
>>>>
>>>> We've discovered an issue on Ubuntu/Lenny with libc6 2.11.1-0ubuntu7.5
>>>> (it may also affect versions between 2.11.1-0ubuntu7.1
>>>> and 2.11.1-0ubuntu7.4). The bug affects systems when a large number of
>>>> threads (or processes) are created rapidly. Once triggered, the system will
>>>> become completely unresponsive for ten to fifteen minutes. We've seen this
>>>> issue on our production Cassandra clusters under high load. Cassandra seems
>>>> particularly susceptible to this issue because of the large thread pools
>>>> that it creates. In particular, we suspect the unbounded thread pool for
>>>> connection management may be pushing some systems over the edge.
>>>>
>>>> We're still trying to narrow down what changed in libc that is causing
>>>> this issue. We also haven't tested things outside of xen, or on non-x86
>>>> architectures. But if you're seeing these symptoms, you may want to try
>>>> upgrading libc6.
>>>>
>>>> I'll send out an update if we find anything else interesting. If anyone
>>>> has any thoughts as to what the cause is, we're all ears!
>>>>
>>>> Hope this saves someone some heart-ache,
>>>>
>>>> Mike
>>>>
>>>
>>>
>>
>

Reply via email to