Re: ContinuousFileMonitoringFunction - deleting file after processing

Kostas Kloudas Tue, 18 Oct 2016 02:53:24 -0700

Hi Maciek,

I agree with you that 1ms is often too long :P


This is the reason why I will open a discussion to have
all the ideas/ requirements / shortcomings in a single place.
This way the community can track and influence what
is coming next. 

Hopefully I will do it in the afternoon and I will send you 
the discussion thread.

Cheers,
Kostas

> On Oct 18, 2016, at 11:43 AM, Maciek Próchniak <m...@touk.pl> wrote:
> 
> Hi Kostas,
> 
> thanks for quick answer.
> 
> I wouldn't dare to delete files in InputFormat if they were splitted and 
> processed in parallel...
> 
> As for using notifyCheckpointComplete - thanks for suggestion, it looks 
> pretty interesting, I'll try to try it out. Although I wonder a bit if 
> relying only on modification timestamp is enough - many things may happen in 
> one ms :)
> 
> thanks,
> 
> macie
> 
> 
> On 18/10/2016 11:14, Kostas Kloudas wrote:
>> Hi Maciek,
>> 
>> Just a follow-up on the previous email, given that splits are read in 
>> parallel, when the
>> ContinuousFileMonitoringFunction forwards the last split, it does not mean 
>> that the
>> final splits is going to be processed last. If the node it gets assigned is 
>> fast enough
>> then it may be processed faster than others.
>> 
>> This assumption only holds if you have a parallelism of 1.
>> 
>> Cheers,
>> Kostas
>> 
>>> On Oct 18, 2016, at 11:05 AM, Kostas Kloudas <k.klou...@data-artisans.com> 
>>> wrote:
>>> 
>>> Hi Maciek,
>>> 
>>> Currently this functionality is not supported but this seems like a good 
>>> addition.
>>> Actually, give that the feature is rather new, we were thinking of opening 
>>> a discussion
>>> in the dev mailing list in order to
>>> 
>>> i) discuss some current limitations of the Continuous File Processing source
>>> ii) see how people use it and adjust our features accordingly
>>> 
>>> I will let you know as soon as I open this thread.
>>> 
>>> By the way for your use-case, we should probably have a callback in the 
>>> notifyCheckpointComplete()
>>> that will inform the source that a given checkpoint was successfully 
>>> performed and then
>>> we can purge the already processed files. This can be a good solution.
>>> 
>>> Thanks,
>>> Kostas
>>> 
>>>> On Oct 18, 2016, at 9:40 AM, Maciek Próchniak <m...@touk.pl> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> we want to monitor hdfs (or local) directory, read csv files that appear 
>>>> and after successful processing - delete them (mainly not to run out of 
>>>> disk space...)
>>>> 
>>>> I'm not quite sure how to achieve it with current implementation. 
>>>> Previously, when we read binary data (unsplittable files) we made small 
>>>> hack and deleted them
>>>> 
>>>> in our FileInputFormat - but now we want to use splits and detecting which 
>>>> split is 'the last one' is no longer so obvious - of course it's also 
>>>> problematic when it comes to checkpointing...
>>>> 
>>>> So my question is - is there a idiomatic way of deleting processed files?
>>>> 
>>>> 
>>>> thanks,
>>>> 
>>>> maciek
>>>> 
>> 
>

Re: ContinuousFileMonitoringFunction - deleting file after processing

Reply via email to