On Wed, Jun 3, 2015 at 12:01 PM, Dan <danharve...@gmail.com> wrote: > A general point we've found with higher network load on some of the newer > ec2 instances. > > The MTU defaults to be larger than usual for jumbo packet support on the > faster network interfaces, but with a bug in the kernel / xen this causes > kernel panic's. So we've set the MTU to be 1500 which fixed that issue > until the kernel support for that improves. > > We're using CoreOS and run datastores (including kafka) with m3/r3/c3 > instance types. > > We found this towards the start of this year, I've not checked recently if > this has been fixed yet. >
We where seeing random networking problems showing up sometimes after restarting a cluster up to recent weeks (tcp dump showed that a scp between nodes never succesfully sent packets larger than a few hundred bytes). Based discussions like this one, we reduced the default MTU to 1492 a few weeks ago and have not seen the problem since then (running Ubuntu 14.04). Also, in separate experiments, on machines running an Ubuntu user land in Docker on top of CentOS 6.6 this problem was never seen (with the default 8K jumbo frames). Peter > > - Dan > > > On 3 June 2015 at 05:46, Theo Hultberg <t...@iconara.net> wrote: > > > Henry: We run Kafka on the old and trusty m1.xlarge. We avoid EBS > > completely, it's network storage that pretends to be local and when the > > network, which is AWS' weak spot, acts up EBS is a big liability. It's > also > > slow and expensive. > > > > Others: Thanks for sharing your experience with the d2's. We have been > > considering them for Kafka, but now it sounds like we should wait with > that > > until they're fixed. > > > > T# > > > > On Wed, Jun 3, 2015 at 1:26 AM, Henry Cai <h...@pinterest.com.invalid> > > wrote: > > > > > Steven, > > > > > > Do you have the AWS case # (or the Ubuntu bug/case #) when you hit that > > > kernel panic issue? > > > > > > Our company will still be running on AMI image 12.04 for a while, I > will > > > see whether the fix was also ported onto Ubuntu 12.04 > > > > > > On Tue, Jun 2, 2015 at 2:53 PM, Steven Wu <stevenz...@gmail.com> > wrote: > > > > > > > now I remember we had same kernel panic issue in the first week of D2 > > > > rolling-out. then AWS fixed it and we haven't seen any issue since. > try > > > > Ubuntu 14.04 and see if it resolves your remaining kernel/instability > > > issue. > > > > > > > > On Tue, Jun 2, 2015 at 2:30 PM, Wes Chow <w...@chartbeat.com> wrote: > > > > > > > >> > > > >> Daniel Nelson <daniel.nel...@vungle.com> > > > >> June 2, 2015 at 4:39 PM > > > >> > > > >> On Jun 2, 2015, at 1:22 PM, Steven Wu <stevenz...@gmail.com> < > > > stevenz...@gmail.com> wrote: > > > >> > > > >> can you elaborate what kind of instability you have encountered? > > > >> > > > >> We have seen the nodes become completely non-responsive. Usually > they > > > get rebooted automatically after 10-20 minutes, but occasionally they > get > > > stuck for days in a state where they cannot be rebooted via the Amazon > > APIs. > > > >> > > > >> > > > >> Same here. It was worse right after d2 launch. We had 6 out of 9 > > servers > > > >> die within 10 hours after spinning them up. Amazon rolled out a fix, > > but > > > >> we're still seeing similar issues, though not nearly as bad. The > first > > > fix > > > >> was for something network related, and apparently sending lots of > data > > > >> through the instances caused a kernel panic on the host. We have no > > > >> information yet about the current issue. > > > >> > > > >> Wes > > > >> > > > >> Steven Wu <stevenz...@gmail.com> > > > >> June 2, 2015 at 4:22 PM > > > >> Wes/Daniel, > > > >> > > > >> can you elaborate what kind of instability you have encountered? > > > >> > > > >> we are on Ubuntu 14.04.2 and haven't encountered any issues so far. > in > > > >> the announcement, they did mention using Ubuntu 14.04 for better > disk > > > >> throughput. not sure whether 14.04 also addresses any instability > > issue > > > you > > > >> encountered or not. > > > >> > > > >> Thanks, > > > >> Steven > > > >> > > > >> In order to ensure the best disk throughput performance from your D2 > > > instances > > > >> on Linux, we recommend that you use the most recent version of the > > > Amazon > > > >> Linux AMI, or another Linux AMI with a kernel version of 3.8 or > later. > > > The > > > >> D2 instances provide the best disk performance when you use a Linux > > > >> kernel that supports Persistent Grants – an extension to the Xen > block > > > ring > > > >> protocol that significantly improves disk throughput and > scalability. > > > The > > > >> following Linux AMIs support this feature: > > > >> > > > >> - Amazon Linux AMI 2015.03 (HVM) > > > >> - Ubuntu Server 14.04 LTS (HVM) > > > >> - Red Hat Enterprise Linux 7.1 (HVM) > > > >> - SUSE Linux Enterprise Server 12 (HVM) > > > >> > > > >> > > > >> > > > >> > > > >> Daniel Nelson <daniel.nel...@vungle.com> > > > >> June 2, 2015 at 2:42 PM > > > >> > > > >> Do you have any workarounds for the d2 issues? We’ve been using them > > for > > > >> our Kafkas too, and ran into the instability. We’re on Ubuntu 12.04 > > and > > > >> plan to try on 14.04 with the latest HWE to see if that helps any. > > > >> > > > >> Thanks! > > > >> Wes Chow <w...@chartbeat.com> > > > >> June 2, 2015 at 1:39 PM > > > >> > > > >> We have run d2 instances with Kafka. They're currently unstable -- > > > Amazon > > > >> confirmed a host issue with d2 instances that gets tickled by a > Kafka > > > >> workload yesterday. Otherwise, it seems the d2 instance type is > ideal > > > as it > > > >> gets an enormous amount of disk throughput and you'll likely be > > network > > > >> bottlenecked. > > > >> > > > >> Wes > > > >> > > > >> > > > >> Steven Wu <stevenz...@gmail.com> > > > >> June 2, 2015 at 1:07 PM > > > >> EBS (network attached storage) has got a lot better over the last a > > few > > > >> years. we don't quite trust it for kafka workload. > > > >> > > > >> At Netflix, we were going with the new d2 instance type (HDD). our > > > >> perf/load testing shows it satisfy our workload. SSD is better in > > > latency > > > >> curve but pretty comparable in terms of throughput. we can use the > > extra > > > >> space from HDD for longer retention period. > > > >> > > > >> On Tue, Jun 2, 2015 at 9:37 AM, Henry Cai > <h...@pinterest.com.invalid > > > > > > >> <h...@pinterest.com.invalid> > > > >> > > > >> > > > > > > > > > > -- Peter Vandenabeele http://www.allthingsdata.io http://www.linkedin.com/in/petervandenabeele https://twitter.com/peter_v gsm: +32-478-27.40.69 e-mail: pe...@vandenabeele.com skype: peter_v_be