Thanks Jay, that's exactly what I was looking for.
On 25 June 2014 04:18, Jay Kreps <jay.kr...@gmail.com> wrote: > The primary relevant factor here is the fsync interval. Kafka's replication > guarantees do not require fsyncing every message, so the reason for doing > so is to handle correlated power loss (a pretty uncommon failure in a real > data center). Replication will handle most other failure modes with much > much lower overhead. > > With a lax fsync interval you can have LOTS of partitions with very little > impact. The OS will do a good job of merging physical writes and scheduling > them in an order that is convenient. > > On the other hand if you fsync on every message you will see a pretty big > drop off with more than one partition per disk as you immediately incur a > 5-10ms seek on each write. > > For a policy in between "on every message" and "when the os feels like it" > you will see performance somewhere in between. > > -Jay > > > On Tue, Jun 24, 2014 at 4:00 AM, Paul Mackles <pmack...@adobe.com> wrote: > > > Its probably best to run some tests that simulate your usage patterns. I > > think a lot of it will be determined by how effectively you are able to > > utilize the OS file cache in which case you could have many more > > partitions. Its a delicate balance but you definitely want to err on the > > side of having more partitions. Keep in mind that you are only able to > > parallelize down to the partition level so if you have only have 2 > > partitions, you can only have 2 consumers. Depending on your volume, that > > might not be enough. > > > > On 6/24/14 6:44 AM, "Daniel Compton" <d...@danielcompton.net> wrote: > > > > >Good point. We've only got two disks per node and two topics so I was > > >planning to have one disk/partition. > > > > > >Our workload is very write heavy so I'm mostly concerned about write > > >throughput. Will we get write speed improvements by sticking to 1 > > >partition/disk or will the difference between 1 and 3 partitions/node be > > >negligible? > > > > > >> On 24/06/2014, at 9:42 pm, Paul Mackles <pmack...@adobe.com> wrote: > > >> > > >> You'll want to account for the number of disks per node. Normally, > > >> partitions are spread across multiple disks. Even more important, the > OS > > >> file cache reduces the amount of seeking provided that you are reading > > >> mostly sequentially and your consumers are keeping up. > > >> > > >>> On 6/24/14 3:58 AM, "Daniel Compton" <d...@danielcompton.net> wrote: > > >>> > > >>> I¹ve been reading the Kafka docs and one thing that I¹m having > trouble > > >>> understanding is how partitions affect sequential disk IO. One of the > > >>> reasons Kafka is so fast is that you can do lots of sequential IO > with > > >>> read-ahead cache and all of that goodness. However, if your broker is > > >>> responsible for say 20 partitions, then won¹t the disk be seeking to > 20 > > >>> different spots for its writes and reads? I thought that maybe > letting > > >>> the OS handle fsync would make this less of an issue but it still > seems > > >>> like it could be a problem. > > >>> > > >>> In our particular situation, we are going to have 6 brokers, 3 in > each > > >>> DC, with mirror maker replication from the secondary DC to the > primary > > >>> DC. We aren¹t likely to need to add more nodes for a while so would > it > > >>>be > > >>> faster to have 1 partition/node than say 3-4/node to minimise the > seek > > >>> times on disk? > > >>> > > >>> Are my assumptions correct or is this not an issue in practice? There > > >>>are > > >>> some nice things about having more partitions like rebalancing more > > >>> evenly if we lose a broker but we don¹t want to make things > > >>>significantly > > >>> slower to get this. > > >>> > > >>> Thanks, Daniel. > > >> > > > > >