Thanks @Jay for suggesting changes to batch.size and linger.ms. I tried them out. It appears one can do better than the default batch.size for this synchronous batch mode with flush().
These new measurements are giving more "rational" numbers which with I can reason and infer some thumb rules (for batch-sync mode using flush). Here are my observations: - The new producer API does much better than the older one for *single threaded* producer. (best# i saw with old is ~68MB/s, with new ~140MB/s) - Higher linger.ms sometimes helps perf and at other times hurts. No simple rule here. Best to try it out and decide whether default is good for your case or not. - For single threaded producer: To get the most throughput, set batch.size = (total bytes between flushes / partition count). - Running more single threaded producer processes helped (till about till 3 / 4 processes) - 1-producer going to single partition is faster than 1 producer going to multiple partitions - The number of bytes between two explicit flushes (ie. flush interval) made much smaller impact than the buffer.size. Something to be learnt here.. my speculation is that with smaller flush intervals this might change. Having two knobs (batch.size & flush interval is a a bit confusing for end users trying to tune it, will be good if we can find if there is some simple guidance feasible) - Other than some inconveniences previously mentioned, I feel flush() could be used as a way to simulate sync-batch behavior. Producer Limits: - Able to exceed 1gigEthernet capacity, but not 10gigEthernet. Does not appear to go beyond ~460MB/s. Verified my test machines are able to achieve 1GB/s. Todo: - Need to try Multi threaded producer. - I did some testing of the Consumer APIs as well with 0.8.1 consumer-perf tool. Wasnt able to push it beyond 30MB/s. When producers ran in parallel it fell to under 10MB/s. Need to dig deeper. Will report back. Suggestions welcome. Measurements: - See attachment - Also available on paste bin: http://pastebin.com/p3kSAjy6
Settings: acks=1, single broker, single threaded producer (new api) Machines: 32 cores, 256GB RAM, 10 gigE, 6x15000 rpm disks 1 partition FlushInt=4MB FlushInt=8MB FlushInt=16MB linger=def batch.size = default 57 54 52 linger=1s batch.size = default 57 61 59 linger=def batch.size= flushInt/parts 136 125 116 linger=1s batch.size= flushInt/parts 92 77 56 linger=def batch.size == flushInt 140 123 124 linger=def batch.size = 10MB 140 123 124 linger=def batch.Size = 20MB 31 30 42 4 partitions FlushInt=4MB FlushInt=8MB FlushInt=16MB linger=def batch.size = default 95 82 80 linger=1s batch.size = default 85 83 85 linger=def batch.size= batch/#part 127 133 90 linger=1s batch.size= batch/#part 94 100 101 linger=def batch.size == flushInt 60 8 6 linger=def batch.size = 10M 7 7 7 linger=def batch.Size = 20M 6 6 5 8 partitions FlushInt=4MB FlushInt=8MB FlushInt=16MB linger=def batch.size = default 100 89 96 linger=1s batch.size = default 105 97 98 linger=def batch.size= batch/#part 114 128 78 linger=1s batch.size= batch/#part 95 94 102 linger=def batch.size == flushInt 7 8 8 linger=def batch.size = 10M 7 8 7 linger=def batch.Size = 20M 6 6 6 With multiple procduers (each single threaded) For 1 partition : 1 process = 136 MB/s 3 process = 344 MB/s 4 process = 290 MB/s For 4 partition (): 1 process = 127 MB/s 3 process = 345 MB/s 4 process = 372 MB/s For 8 partition (): 1 process = 128 MB/s 3 process = 304 MB/s 4 process = 460 MB/s