Hi Saravana,

Flume will check the size and the time of the last edit to the file when it 
starts reading it and when it has finished reading. If the two sets of values 
differ between the start and end of the file reading process, Flume will fail 
noisily. This means that you must move a fully written file to the directory or 
it will not be ingested into your workflow. If you're running it on a unix 
system, you can't use a cp command to drop the file into the directory as cp 
uses incremental writes whereas mv will move the file in one go.



Regards,
Guy Needham | Data Discovery
Virgin Media | Enterprise Data, Design & Management
Bartley Wood Business Park, Hook, Hampshire RG27 9UP
D 01256 75 3362
I welcome VSRE emails. Learn more at http://vsre.info/



________________________________
From: SaravanaKumar TR [mailto:saran0081...@gmail.com]
Sent: 23 July 2014 06:38
To: user@flume.apache.org
Subject: Re: how spooling directory source identifies the complete file

Thanks Ashish , I already referred to this info.

But I couldn't see any explanation in flume user guide about how flume 
differentiates between copy-in progress file and fully copied file.


On Wed, Jul 23, 2014 at 10:59 AM, Ashish 
<paliwalash...@gmail.com<mailto:paliwalash...@gmail.com>> wrote:
This is specified in Flume's User Guide

"Unlike the Exec source, this source is reliable and will not miss data, even 
if Flume is restarted or killed. In exchange for this reliability, only 
immutable, uniquely-named files must be dropped into the spooling directory. 
Flume tries to detect these problem conditions and will fail loudly if they are 
violated:

  1.  If a file is written to after being placed into the spooling directory, 
Flume will print an error to its log file and stop processing.
  2.  If a file name is reused at a later time, Flume will print an error to 
its log file and stop processing.

To avoid the above issues, it may be useful to add a unique identifier (such as 
a timestamp) to log file names when they are moved into the spooling directory."


On Wed, Jul 23, 2014 at 10:17 AM, SaravanaKumar TR 
<saran0081...@gmail.com<mailto:saran0081...@gmail.com>> wrote:
Hi Jeff,

Thanks of your comments.But what I am really looking for is  , consider we are 
copying a file of 1 GB to spool directory , if suppose copy is in progress , 
how flume recognize that the complete file is copied into the spool directory 
and the file is ready for processing ?

how flume make sure it doesnt start processing the partially copied file.


On Tue, Jul 22, 2014 at 11:15 PM, Jeff Lord 
<jl...@cloudera.com<mailto:jl...@cloudera.com>> wrote:
I believe the way this works is that flume creates a meta directory to track 
which file is being read.
In the event of a restart of the agent the entire file will be re-read which 
will create some duplicate events.

https://github.com/apache/flume/blob/flume-1.5/flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java#L474


On Tue, Jul 22, 2014 at 6:15 AM, SaravanaKumar TR 
<saran0081...@gmail.com<mailto:saran0081...@gmail.com>> wrote:
Hi,

I am planning to use spooling directory to move logfiles in hdfs sink.

I like to know how flume identifies the file we are moving to spool directory 
is complete file or partial & its move still in progress.

if suppose a file is of large size and we started moving it to spooler 
directory , how flume identifies that the complete file is transferred or is 
still in progress.

Please help me out here.

Thanks,
saravana





--
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal


--------------------------------------------------------------------
Save Paper - Do you really need to print this e-mail?

Visit www.virginmedia.com for more information, and more fun.

This email and any attachments are or may be confidential and legally privileged
and are sent solely for the attention of the addressee(s). If you have received 
this
email in error, please delete it from your system: its use, disclosure or 
copying is
unauthorised. Statements and opinions expressed in this email may not represent
those of Virgin Media. Any representations or commitments in this email are
subject to contract. 

Registered office: Media House, Bartley Wood Business Park, Hook, Hampshire, 
RG27 9UP
Registered in England and Wales with number 2591237

Reply via email to