That's interesting. For us, the 7.5 version of libc was causing problems. Either way, I'm looking forward to hearing about anything you find.
Mike On Thu, Jan 13, 2011 at 11:47 PM, Erik Onnen <eon...@gmail.com> wrote: > Too similar to be a coincidence I'd say: > > Good node (old AZ): 2.11.1-0ubuntu7.5 > Bad node (new AZ): 2.11.1-0ubuntu7.6 > > You beat me to the punch with the test program. I was working on something > similar to test it out and got side tracked. > > I'll try the test app tomorrow and verify the versions of the AMIs used for > provisioning. > > On Thu, Jan 13, 2011 at 11:31 PM, Mike Malone <m...@simplegeo.com> wrote: > >> Erik, the scenario you're describing is almost identical to what we've >> been experiencing. Sounds like you've been pulling your hair out too! You're >> also running the same distro and kernel as us. And we also run without swap. >> Which begs the question... what version of libc6 are you running!? Here's >> the output from one of our upgraded boxes: >> >> $ dpkg --list | grep libc6 >> ii libc6 2.11.1-0ubuntu7.7 >> Embedded GNU C Library: Shared libraries >> ii libc6-dev 2.11.1-0ubuntu7.7 >> Embedded GNU C Library: Development Librarie >> >> Before upgrading the version field showed 2.11.1-0ubuntu7.5. Wondering >> what yours is. >> >> We also found ourselves in a similar situation with different regions. >> We're using the canonical ubuntu ami as the base for our systems. But there >> appear to be small differences between the packages included in the amis >> from different regions. Seems libc6 is one of the things that changed. I >> discovered by diff'ing `dpkg --list` on a node that was good, and one that >> was bad. >> >> The architecture hypothesis is also very interesting. If we could >> reproduce the bug with the latest libc6 build I'd escalate it back up to >> Amazon. But I can't repro it, so nothing to escalate. >> >> For what it's worth, we were able to reproduce the lockup behavior that >> you're describing by running a tight loop that spawns threads. Here's a gist >> of the app I used: https://gist.github.com/a4123705e67e9446f1cc -- I'd be >> interested to know whether that locks things up on your system with a new >> libc6. >> >> Mike >> >> On Thu, Jan 13, 2011 at 10:39 PM, Erik Onnen <eon...@gmail.com> wrote: >> >>> May or may not be related but I thought I'd recount a similar experience >>> we had in EC2 in hopes it helps someone else. >>> >>> As background, we had been running several servers in a 0.6.8 ring with >>> no Cassandra issues (some EC2 issues, but none related to Cassandra) on >>> multiple EC2 XL instances in a single availability zone. We decided to add >>> several other nodes to a second AZ for reasons beyond the scope of this >>> email. As we reached steady operational state in the new AZ, we noticed that >>> the new nodes in the new AZ were repeatedly getting dropped from the ring. >>> At first we attributed the drops to phi and expected cross-AZ latency. As we >>> tried to pinpoint the issue, we found something very similar to what you >>> describe - the EC2 VMs in the new AZ would become completely unresponsive. >>> Not just the Java process hosting Cassandra, but the entire host. Shell >>> commands would not execute for existing sessions, we could not establish new >>> SSH sessions and tails we had on active files wouldn't show any progress. It >>> appeared as if the machines in the new AZ would seize for several minutes, >>> then come back to life with little rhyme or reason as to why. Tickets opened >>> with AMZN resulted in responses of "the physical server looks normal". >>> >>> After digging deeper, here's what we found. To confirm all nodes in both >>> AZs were identical at the following levels: >>> * Kernel (2.6.32-305-ec2 #9-Ubuntu SMP) distro (Ubuntu 10.04.1 LTS) and >>> glibc on x86_64 >>> * All nodes were running identical Java distributions that we deployed >>> ourselves, sun 1.6.0_22-b04 >>> * Same amount of virtualized RAM visible to the guest, same RAID stripe >>> configuration across the same size/number of ephemeral drives >>> >>> We noticed two things that were different across the VMs in the two AZs: >>> * The class of CPU exposed to the guest OSes across the two AZs (and >>> presumably the same physical server above that guest). >>> ** On hosts in the AZ not having issues, we see from the guest older >>> Harpertown class Intel CPUs: "model name : Intel(R) Xeon(R) CPU >>> E5430 @ 2.66GHz" >>> ** On hosts in the AZ having issues, we see from the guest newer Nehalem >>> class Intel CPUs: "model name : Intel(R) Xeon(R) CPU >>> E5507 @ 2.27GHz" >>> * Percent steal was consistently higher on the new nodes, on average 25% >>> where as the older (stable) VMs were around 9% at peak load >>> >>> Consistently in our case, we only saw this seizing behavior on guests >>> running on the newer Nehalem architecture CPUs. >>> >>> In digging a bit deeper on the problem machines, we also noticed the >>> following: >>> * Most of the time, ParNew GC on the problematic hosts was fine, >>> averaging around .04 "real" seconds. After spending time tuning the >>> generations and heap size for our workload, we rarely have CMS collections >>> and almost never have Full GCs, even during full or anti-compactions. >>> * Rarely, and at the same time as the problematic machines would seize, a >>> long running ParNew collection would be recorded after the guest came back >>> to life. Consistently this was between 180 and 220 seconds regardless of >>> host, plenty of time for that host to be shunned from the ring. >>> >>> The long ParNew GCs were a mystery. They *never* happened on the hosts in >>> the other AZ (the Harpertown class) and rarely happened on the new guests >>> but we did observe the behavior within three hours of normal operation on >>> each host in the new AZ. >>> >>> After lots of trial and error, we decided to remove ParNew collections >>> from the equation and tried running a host in the new AZ with >>> "-XX:-UseParNewGC" and this eliminated the long ParNew problem. The flip >>> side is, we now do serial collections on the young generation for half our >>> ring which means those nodes spend about 4x more time in GC than the other >>> nodes, but they've been stable for two weeks since the change. >>> >>> That's what we know for sure and we're back to operating without a hitch >>> with the one JVM option change. >>> >>> <editorial> >>> What I think is happening is more complicated. Skip this part if you >>> don't care about opinion and some of this reasoning is surely incorrect. In >>> talking with multiple VMWare experts (I don't have much experience in Xen >>> but I imagine the same is true there as well), it's generally a bad idea to >>> virtualize too many cores (two seems to be the sweet spot). Reason being >>> that if you have a heavily multithreaded application and that app relies on >>> consistent application of memory barriers across multiple cores (as Java >>> does), the Hypervisor has to wait for multiple physical cores to become >>> available before it schedules the guest so that each virtual core gets a >>> consistent view of the virtual memory while scheduled. If the physical >>> server is overcommitted, that wait time is exacerbated as the guest waits >>> for the correct number of physical cores to become available (4 in our >>> case). It's possible to tell this in VMware via esxtop, not sure in Xen. It >>> would also be somewhat visible via %steal increases in the guest which we >>> saw, but that doesn't really explain a two minute pause during garbage >>> collection. My guess then, is that one or more of the following are at play >>> in this scenario: >>> >>> 1) a core nehalem bug - the nehalem architecture made a lot of changes to >>> the way it manges TLBs for memory, largely as a virtualization optimization. >>> I doubt this is the case but assuming the guest isn't seeing a different >>> architecture, we did see this issue only on E5507 processors. >>> 2) the physical servers in the new AZ are drastically overcommitted - >>> maybe AMZN bought into the notion that Nehalems are better at >>> virtualization and is allowing more guests to run on physical servers >>> running Nehalems. I've no idea, just a hypothesis. >>> 3) a hypervisor bug - I've deployed large JVMs to big physical Nehalem >>> boxen running Cassandra clusters under high load and never seen behavior >>> like the above. If I could see more of what the hypervisor was doing I'd >>> have a pretty good idea here, but such is life in the cloud. >>> >>> </editorial> >>> >>> I also should say that I don't think any issues we had were at all >>> related specifically to Cassandra. We were running fine in the first AZ, no >>> problems other than needing to grow capacity. Only when we saw the different >>> architecture in the new EC2 AZ did we experience problems and when we >>> shackled the new generation collector, the bad problems went away. >>> >>> Sorry for the long tirade. This was originally going to be a blog post >>> but I though it would have more value in context here. I hope ultimately it >>> helps someone else. >>> -erik >>> >>> >>> On Thu, Jan 13, 2011 at 5:26 PM, Mike Malone <m...@simplegeo.com> wrote: >>> >>>> Hey folks, >>>> >>>> We've discovered an issue on Ubuntu/Lenny with libc6 2.11.1-0ubuntu7.5 >>>> (it may also affect versions between 2.11.1-0ubuntu7.1 >>>> and 2.11.1-0ubuntu7.4). The bug affects systems when a large number of >>>> threads (or processes) are created rapidly. Once triggered, the system will >>>> become completely unresponsive for ten to fifteen minutes. We've seen this >>>> issue on our production Cassandra clusters under high load. Cassandra seems >>>> particularly susceptible to this issue because of the large thread pools >>>> that it creates. In particular, we suspect the unbounded thread pool for >>>> connection management may be pushing some systems over the edge. >>>> >>>> We're still trying to narrow down what changed in libc that is causing >>>> this issue. We also haven't tested things outside of xen, or on non-x86 >>>> architectures. But if you're seeing these symptoms, you may want to try >>>> upgrading libc6. >>>> >>>> I'll send out an update if we find anything else interesting. If anyone >>>> has any thoughts as to what the cause is, we're all ears! >>>> >>>> Hope this saves someone some heart-ache, >>>> >>>> Mike >>>> >>> >>> >> >