We have a 32 data node Hadoop cluster that receives incoming flume data via three data nodes acting as flume agents. We’re using round robin DNS entries to spread incoming flume data from various external architectures to the three flume agents on those three data nodes.
It seems like historically, the three data nodes that are the flume agents always have many more blocks than other data nodes, so I’m wondering what the best approach for placement of flume agents would be within a cluster. Should all data nodes in the cluster be flume nodes, or should the flume agent be placed on a name node or other non-data node? Thanks for any guidance.