Re: how spooling directory source identifies the complete file

SaravanaKumar TR Tue, 22 Jul 2014 22:39:07 -0700

Thanks Ashish , I already referred to this info.

But I couldn't see any explanation in flume user guide about how flume
differentiates between copy-in progress file and fully copied file.



On Wed, Jul 23, 2014 at 10:59 AM, Ashish <paliwalash...@gmail.com> wrote:

> This is specified in Flume's User Guide
>
> "Unlike the Exec source, this source is reliable and will not miss data,
> even if Flume is restarted or killed. In exchange for this reliability,
> only immutable, uniquely-named files must be dropped into the spooling
> directory. Flume tries to detect these problem conditions and will fail
> loudly if they are violated:
>
>    1. If a file is written to after being placed into the spooling
>    directory, Flume will print an error to its log file and stop processing.
>    2. If a file name is reused at a later time, Flume will print an error
>    to its log file and stop processing.
>
> To avoid the above issues, it may be useful to add a unique identifier
> (such as a timestamp) to log file names when they are moved into the
> spooling directory."
>
>
> On Wed, Jul 23, 2014 at 10:17 AM, SaravanaKumar TR <saran0081...@gmail.com
> > wrote:
>
>> Hi Jeff,
>>
>> Thanks of your comments.But what I am really looking for is  , consider
>> we are copying a file of 1 GB to spool directory , if suppose copy is in
>> progress , how flume recognize that the complete file is copied into the
>> spool directory and the file is ready for processing ?
>>
>> how flume make sure it doesnt start processing the partially copied file.
>>
>>
>> On Tue, Jul 22, 2014 at 11:15 PM, Jeff Lord <jl...@cloudera.com> wrote:
>>
>>> I believe the way this works is that flume creates a meta directory to
>>> track which file is being read.
>>> In the event of a restart of the agent the entire file will be re-read
>>> which will create some duplicate events.
>>>
>>>
>>> https://github.com/apache/flume/blob/flume-1.5/flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java#L474
>>>
>>>
>>> On Tue, Jul 22, 2014 at 6:15 AM, SaravanaKumar TR <
>>> saran0081...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am planning to use spooling directory to move logfiles in hdfs sink.
>>>>
>>>> I like to know how flume identifies the file we are moving to spool
>>>> directory is complete file or partial & its move still in progress.
>>>>
>>>> if suppose a file is of large size and we started moving it to spooler
>>>> directory , how flume identifies that the complete file is transferred or
>>>> is still in progress.
>>>>
>>>> Please help me out here.
>>>>
>>>> Thanks,
>>>> saravana
>>>>
>>>
>>>
>>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: how spooling directory source identifies the complete file

Reply via email to