I think we're talking at cross purposes here. First, let me re-state the 
problem at a high level. Hopefully this will be clearer:

I want to have two or more Flume agents running on different hosts reading 
events from the same source (RabbitMQ) but only writing the event to the final 
sink (hdfs) once. Thus, if someone kicks out a cable and one of the Flume hosts 
dies, the other(s) should take over seamlessly without the loss of any events. 
I thought that failover and load-balancer sinkgroups were created to solve this 
kind of problem, but the documentation only covers the case where all the sinks 
in the sinkgroup are on one host, and this does not give me the redundancy I 
need.

This is how I tried to solve the problem. I set up two hosts running Flume with 
the same configuration. Thinking that a failover sinkgroup was the right 
approach to tackling my problem, I created a sinkgroup and put two hdfs sinks 
in it, hdfsSink-1 and hdfsSink-2. The idea was that hdfsSink-1 would be on 
Flume host A and hdfsSink-2 would be on Flume host B. Then, events arrive on 
both host A and host B, and host A would write the event to hdfsSink-2, sending 
it across the network to Flume Host B, while host B would write the same event 
to hdfsSink-2 which is local to it.  Both agents should, in theory, write the 
event to the same sink on Flume Host B. Yes, I know that this would still 
duplicates the events, but I figured I would worry about that later. However, 
this has not worked as I expected. Flume Host A does not pass anything to Flume 
Host B. 

I need to know if the approach I've adopted, sinkgroups and failover, can 
achieve the kind of multi-host redundancy I want. If they can, are there 
examples of such configurations that I can see? If they cannot, what kind of 
configuration would be suitable for what I want to achieve? 

From: Hari Shreedharan [mailto:[email protected]] 
Sent: Wednesday, 20 February 2013 5:39 p.m.
To: [email protected]
Subject: Re: Architecting Flume for failover

Also, as Jeff said, sink-2 has a higher priority (the absolute value of the 
priority being higher, that sink is picked up). 




-- 
Hari Shreedharan

On Tuesday, February 19, 2013 at 8:37 PM, Hari Shreedharan wrote:
No, it does not mean that. To talk to different HDFS clusters you must specify 
the hdfs.path as hdfs://namenode:port/<path>. You don't need to specify the 
bind etc. 

Hope this helps.

Hari

-- 
Hari Shreedharan

On Tuesday, February 19, 2013 at 8:18 PM, Noel Duffy wrote:
Hari Shreedharan [mailto:[email protected]] wrote:

The "bind" configuration param does not really exist for HDFS Sink (it is only 
for the IPC sources). 

Does this mean that failover for sinks on different hosts cannot work for HDFS 
sinks at all? Does it require Avro sinks, which seem to have a hostname 
parameter?


Reply via email to