I am using 0.8. I figured out the issue. I used only one thread in each producer. That is why the performance degrades greatly when the number of topics and number of partitions increase. In fact, this number of threads is a very important parameter but not mentioned in the documents.
Regards, Libo