Sink-groups are a concept local to a single agent. What is happening to
you is both your agents have the same sink group, and they're both
selecting the highest priority sink.
Agents are independent and communicate with each other via sink-source
couplings such as avro-sink to avro-source(or thrift).
Fail-over and load balancing in sink groups is intended to be a
mechanism that redirects traffic when a particular sink fails. So if say
you have two hdfs clusters, and one fails, you can reroute your data to
the secondary one until it recovers.
If you want to failover agents, what you likely want to do to
guarrantee immediate delivery is just set up two agents that are both
directing the data to your hdfs cluster. Then when processing the data
filter out duplicates.
If this isn't workable, more complex methods might involve adding
sequence numbers to the data, and another flume agent with an
interceptor that filters out duplicates based on the sequence numbers.
This would be rather complex, and isn't something built into flume.
On 02/20/2013 03:07 PM, Noel Duffy wrote:
I think we're talking at cross purposes here. First, let me re-state the
problem at a high level. Hopefully this will be clearer:
I want to have two or more Flume agents running on different hosts reading
events from the same source (RabbitMQ) but only writing the event to the final
sink (hdfs) once. Thus, if someone kicks out a cable and one of the Flume hosts
dies, the other(s) should take over seamlessly without the loss of any events.
I thought that failover and load-balancer sinkgroups were created to solve this
kind of problem, but the documentation only covers the case where all the sinks
in the sinkgroup are on one host, and this does not give me the redundancy I
need.
This is how I tried to solve the problem. I set up two hosts running Flume with
the same configuration. Thinking that a failover sinkgroup was the right
approach to tackling my problem, I created a sinkgroup and put two hdfs sinks
in it, hdfsSink-1 and hdfsSink-2. The idea was that hdfsSink-1 would be on
Flume host A and hdfsSink-2 would be on Flume host B. Then, events arrive on
both host A and host B, and host A would write the event to hdfsSink-2, sending
it across the network to Flume Host B, while host B would write the same event
to hdfsSink-2 which is local to it. Both agents should, in theory, write the
event to the same sink on Flume Host B. Yes, I know that this would still
duplicates the events, but I figured I would worry about that later. However,
this has not worked as I expected. Flume Host A does not pass anything to Flume
Host B.
I need to know if the approach I've adopted, sinkgroups and failover, can
achieve the kind of multi-host redundancy I want. If they can, are there
examples of such configurations that I can see? If they cannot, what kind of
configuration would be suitable for what I want to achieve?
From: Hari Shreedharan [mailto:[email protected]]
Sent: Wednesday, 20 February 2013 5:39 p.m.
To: [email protected]
Subject: Re: Architecting Flume for failover
Also, as Jeff said, sink-2 has a higher priority (the absolute value of the
priority being higher, that sink is picked up).
--
Hari Shreedharan
On Tuesday, February 19, 2013 at 8:37 PM, Hari Shreedharan wrote:
No, it does not mean that. To talk to different HDFS clusters you must specify the
hdfs.path as hdfs://namenode:port/<path>. You don't need to specify the bind
etc.
Hope this helps.
Hari
--
Hari Shreedharan
On Tuesday, February 19, 2013 at 8:18 PM, Noel Duffy wrote:
Hari Shreedharan [mailto:[email protected]] wrote:
The "bind" configuration param does not really exist for HDFS Sink (it is only
for the IPC sources).
Does this mean that failover for sinks on different hosts cannot work for HDFS
sinks at all? Does it require Avro sinks, which seem to have a hostname
parameter?