[ https://issues.apache.org/jira/browse/HADOOP-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeff Lord resolved HADOOP-9198. ------------------------------- Resolution: Fixed This is a flume issue and will be moved to that jira accordingly. > Update Flume Wiki and User Guide to provide clearer explanation of BatchSize, > ChannelCapacity and ChannelTransactionCapacity properties. > ---------------------------------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-9198 > URL: https://issues.apache.org/jira/browse/HADOOP-9198 > Project: Hadoop Common > Issue Type: Improvement > Components: documentation > Reporter: Jeff Lord > > It would be good if we refined our wiki and user guide to help explain the > following in a more clear fashion: > 1) Batch Size > 1.a) When configured by client code using the flume-core-sdk , to send > events to flume avro source. > The flume client sdk has an appendBatch method. This will take a list of > events and send them to the source as a batch. This is the size of the number > of events to be passed to the source at one time. > 1.b) When set as a parameter on HDFS sink (or other sinks which support > BatchSize parameter) > This is the number of events written to file before it is flushed to HDFS > 2) > 2.a) Channel Capacity > This is the maximum capacity number of events of the channel. > 2.b) Channel Transaction Capacity. > This is the max number of events stored in the channel per transaction. > How will setting these parameters to different values, affect throughput, > latency in event flow? > In general you will see better throughput by using memory channel as opposed > to using file channel at the loss of durability. > The channel capacity is going to need to be sized such that it is large > enough to hold as many events as will be added to it by upstream agents. > Ideal flow would see the sink draining events from the channel faster than it > is having events added by its source. > The channel transaction capacity will need to be smaller than the channel > capacity. > e.g. If your Channel capacity is set to 10000 than Channel Transaction > Capacity should be set to something like 100. > Specifically if we have clients with varying frequency of event generation, > i.e. some clients generating thousands of events/sec, while > others at a much slower rate, what effect will different values of these > params have on these clients ? > Transaction Capacity is going to be what throttles or limits how many events > the source can put into the channel. This going to vary depending on how many > tiers of agents/collectors you have setup. > In general though this should probably be equal to whatever you have the > batch size set to in your client. > With regards to the hdfs batch size, the larger your batch size the better > performance will be. However, keep in mind that if a transaction fails the > entire transaction will be replayed which could have the implication of > duplicate events downstream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira