We've (www.plumbee.co.uk) have been using Flume NG in combination with S3
successfully for about 10 months now without any major issues. Our whole
tech stack is hosted in AWS and on average we process 450 million events
per day, (approx 120GB) all of which is collected via Flume, aggregated
using the the FileChannel backed by EBS volumes and uploaded using the HDFS
event sink to S3.

The data we collect represents analytics events from our gaming platform
which cannot be recovered if lost so reliability and durability are very
important to us. Now although Flume has a great transactional model to
achieve this, the S3 filesystem implementation provided by the Hadoop
project has several issues which resulted in us modifying it heavily.

One such problem is that the syncFs() method of the filesystem which should
trigger any system buffers to be written out actually does nothing in the
context of S3. So while Flume believes the data is safe and removes it from
the channel you have no guarantees it is.

Also the S3 filesystem buffers data locally on disk first and only on
close() are the contents of the file uploaded to S3. If for whatever reason
Flume crashes or the box dies the contents of those files are just orphaned
on the local filesystem and you have to manually recover them (assuming
they aren't also corrupted).

If you have any other questions about our setup just ask!

Cheers,
Dennis

Reply via email to