We fixed two serious Snappy bugs for 0.8.2.2, so you may want to check that.
On Thu, Jan 21, 2016 at 8:16 AM, Cliff Rhyne <crh...@signal.co> wrote: > Hi Leo, > > I'm not sure if this is the issue you're encountering, but this is what we > found when we went from 0.8.1.1 to 0.8.2.1. > > Snappy compression didn't work as expected. Something in the library broke > compressing bundles of messages and each message was compressed > individually (which for us caused a lot of overhead). Disk usage went way > up and CPU usage went incrementally up (still under 1%). I didn't monitor > latency, it was well within the tolerances of our system. We resolved this > issue by switching our compression to gzip. > > This issue is supposedly fixed in 0.9.0.0 but we haven't verified it yet. > > Cliff > > On Thu, Jan 21, 2016 at 4:04 AM, Clelio De Souza <cleli...@gmail.com> > wrote: > > > Hi all, > > > > > > We are using Kafka in production and we have been facing some performance > > degradation of the cluster, apparently after the cluster is a bit "old". > > > > > > We have our production cluster which is up and running since 31/12/2015 > and > > performance tests on our application measuring a full round trip of TCP > > packets and Kafka producing/consumption of data (3 hops in total for > every > > single TCP packet being sent, persisted and consumed in the other end). > The > > results for the production cluster shows a latency of ~ 130ms to 200ms. > > > > > > In our Test environment we have the very same software and specification > in > > AWS instances, i.e. Test environment as being a mirror of Prod. The Kafka > > cluster has been running in Test since 18/12/2015 and the same > performance > > tests (as described above) shows a increase of latency to ~ 800ms to > > 1000ms. > > > > > > We have just recently setup a new fresh Kafka cluster (on 18/01/2016) > > trying to get to the bottom of this performance degradation problem and > in > > the new Kafka cluster deployed in Test in replacement of the original > Test > > Kafka cluster we found a very small latency of ~ 10ms to 15ms. > > > > > > We are using Kafka 0.8.2.1 version for all those environment mentioned > > above. And the same cluster configuration has been setup on all of them, > as > > 3 brokers as m3.xlarge AWS instances. The amount of data and Kafka topics > > are roughly the same among those environments, therefore the performance > > degradation seems to be not directly related to the amount of data in the > > cluster. We suspect that something running inside of the Kafka cluster, > > such as repartitioning or log rentention (even though our topics are to > > setup to last for ~ 2 years and it has not elapsed this time at all). > > > > > > The Kafka broker config can be found as below. If anyone could shed some > > lights on what it may be causing the performance degradation for our > Kafka > > cluster, it would be great and very much appreciate it. > > > > > > Thanks, > > Leo > > > > -------------------- > > > > > > # Licensed to the Apache Software Foundation (ASF) under one or more > > # contributor license agreements. See the NOTICE file distributed with > > # this work for additional information regarding copyright ownership. > > # The ASF licenses this file to You under the Apache License, Version 2.0 > > # (the "License"); you may not use this file except in compliance with > > # the License. You may obtain a copy of the License at > > # > > # http://www.apache.org/licenses/LICENSE-2.0 > > # > > # Unless required by applicable law or agreed to in writing, software > > # distributed under the License is distributed on an "AS IS" BASIS, > > # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > implied. > > # See the License for the specific language governing permissions and > > # limitations under the License. > > # see kafka.server.KafkaConfig for additional details and defaults > > > > ############################# Server Basics ############################# > > > > # The id of the broker. This must be set to a unique integer for each > > broker. > > broker.id=<broker_num> > > > > ############################# Socket Server Settings > > ############################# > > > > # The port the socket server listens on > > port=9092 > > > > # Hostname the broker will bind to. If not set, the server will bind to > all > > interfaces > > #host.name=localhost > > > > # Hostname the broker will advertise to producers and consumers. If not > > set, it uses the > > # value for "host.name" if configured. Otherwise, it will use the value > > returned from > > # java.net.InetAddress.getCanonicalHostName(). > > #advertised.host.name=<hostname routable by clients> > > > > # The port to publish to ZooKeeper for clients to use. If this is not > set, > > # it will publish the same port that the broker binds to. > > #advertised.port=<port accessible by clients> > > > > # The number of threads handling network requests > > num.network.threads=3 > > > > # The number of threads doing disk I/O > > num.io.threads=8 > > > > # The send buffer (SO_SNDBUF) used by the socket server > > socket.send.buffer.bytes=102400 > > > > # The receive buffer (SO_RCVBUF) used by the socket server > > socket.receive.buffer.bytes=102400 > > > > # The maximum size of a request that the socket server will accept > > (protection against OOM) > > socket.request.max.bytes=104857600 > > > > ############################# Log Basics ############################# > > > > # A comma seperated list of directories under which to store log files > > log.dirs=/data/kafka/logs > > > > # The default number of log partitions per topic. More partitions allow > > greater > > # parallelism for consumption, but this will also result in more files > > across > > # the brokers. > > num.partitions=8 > > > > # The number of threads per data directory to be used for log recovery at > > startup and flushing at shutdown. > > # This value is recommended to be increased for installations with data > > dirs located in RAID array. > > num.recovery.threads.per.data.dir=1 > > > > ############################# Log Flush Policy > > ############################# > > > > # Messages are immediately written to the filesystem but by default we > only > > fsync() to sync > > # the OS cache lazily. The following configurations control the flush of > > data to disk. > > # There are a few important trade-offs here: > > # 1. Durability: Unflushed data may be lost if you are not using > > replication. > > # 2. Latency: Very large flush intervals may lead to latency spikes > when > > the flush does occur as there will be a lot of data to flush. > > # 3. Throughput: The flush is generally the most expensive operation, > > and a small flush interval may lead to exceessive seeks. > > # The settings below allow one to configure the flush policy to flush > data > > after a period of time or > > # every N messages (or both). This can be done globally and overridden > on a > > per-topic basis. > > > > # The number of messages to accept before forcing a flush of data to disk > > #log.flush.interval.messages=10000 > > > > # The maximum amount of time a message can sit in a log before we force a > > flush > > #log.flush.interval.ms=1000 > > > > ############################# Log Retention Policy > > ############################# > > > > # The following configurations control the disposal of log segments. The > > policy can > > # be set to delete segments after a period of time, or after a given size > > has accumulated. > > # A segment will be deleted whenever *either* of these criteria are met. > > Deletion always happens > > # from the end of the log. > > > > # The minimum age of a log file to be eligible for deletion > > # Failsafe is we don't lose any messages for 20+ years, topics should > > # be configured individually > > log.retention.hours=200000 > > > > # A size-based retention policy for logs. Segments are pruned from the > log > > as long as the remaining > > # segments don't drop below log.retention.bytes. > > #log.retention.bytes=1073741824 > > > > # The maximum size of a log segment file. When this size is reached a new > > log segment will be created. > > log.segment.bytes=1073741824 > > > > # The interval at which log segments are checked to see if they can be > > deleted according > > # to the retention policies > > log.retention.check.interval.ms=300000 > > > > # By default the log cleaner is disabled and the log retention policy > will > > default to just delete segments after their retention expires. > > # If log.cleaner.enable=true is set the cleaner will be enabled and > > individual logs can then be marked for log compaction. > > log.cleaner.enable=false > > > > default.replication.factor=3 > > > > auto.create.topics.enable=true > > > > controlled.shutdown.enable=true > > > > delete.topic.enable=true > > > > ############################# Zookeeper ############################# > > > > # Zookeeper connection string (see zookeeper docs for details). > > # This is a comma separated host:port pairs, each corresponding to a zk > > # server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002". > > # You can also append an optional chroot string to the urls to specify > the > > # root directory for all kafka znodes. > > > zookeeper.connect=<zk1-address>:2181,<zk2-address>:2181,<zk3-address>:2181 > > > > # Timeout in ms for connecting to zookeeper > > zookeeper.connection.timeout.ms=6000 > > > > > > -- > Cliff Rhyne > Software Engineering Lead > e: crh...@signal.co > signal.co > ________________________ > > Cut Through the Noise > > This e-mail and any files transmitted with it are for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. Any unauthorized use of this email is strictly prohibited. > ©2015 Signal. All rights reserved. >